# Text Atomization Workflow

**Primary Layer: Daily Atomization Outputs**

This notebook demonstrates the recursive text atomization methodology for literary-computational scholarship.

## Overview

Process literary texts through:
- N-gram frequency analysis
- Entropy measurements
- Glyph fusion mappings
- Recursive pattern detection

Outputs: Daily Markdown + consolidated JSON archives

In [None]:
import sys
from pathlib import Path

# Add src to path
project_root = Path.cwd().parent
sys.path.insert(0, str(project_root / 'src'))

from atomization import TextAtomizer, NGramExtractor, EntropyAnalyzer, GlyphFusionMapper
import json

## 1. Load Text Corpus

Load your literary text (Odyssey, Metamorphoses, Beowulf, etc.)

In [None]:
# Example: Load from file
corpus_path = project_root / 'data' / 'raw' / 'corpora' / 'odyssey_excerpt.txt'

# For demo, use sample text
sample_text = """
Sing to me of the man, Muse, the man of twists and turns
driven time and again off course, once he had plundered
the hallowed heights of Troy. Many cities of men he saw and learned their minds,
many pains he suffered, heartsick on the open sea,
fighting to save his life and bring his comrades home.
But he could not save them from disaster, hard as he strove—
the recklessness of their own ways destroyed them all,
the blind fools, they devoured the cattle of the Sun
and the Sungod wiped from sight the day of their return.
Launch out on his story, Muse, daughter of Zeus,
start from where you will—sing for our time too.
"""

# Try to load from file, fall back to sample
if corpus_path.exists():
    with open(corpus_path, 'r', encoding='utf-8') as f:
        text = f.read()
else:
    print(f"Corpus file not found at {corpus_path}")
    print("Using sample Odyssey excerpt...")
    text = sample_text

print(f"Loaded text: {len(text)} characters, {len(text.split())} words")

## 2. Initialize Text Atomizer

In [None]:
atomizer = TextAtomizer(
    text=text,
    title="Odyssey - Opening Invocation",
    metadata={
        'author': 'Homer',
        'translator': 'Robert Fagles',
        'genre': 'epic poetry',
        'tradition': 'Greek'
    }
)

print("Atomizer initialized")

## 3. Execute Atomization Pipeline

In [None]:
results = atomizer.atomize(
    ngram_range=(1, 3),  # unigrams through trigrams
    top_n=20,
    calculate_entropy=True,
    map_glyphs=True
)

print("Atomization complete!")
print(f"Analyzed {results['metadata']['word_count']} words")

## 4. Explore N-gram Patterns

In [None]:
print("=" * 60)
print("TOP UNIGRAMS")
print("=" * 60)
for item in results['ngrams']['1-grams'][:10]:
    print(f"  {item['text']:20} → {item['frequency']:3} occurrences")

print("\n" + "=" * 60)
print("TOP BIGRAMS")
print("=" * 60)
for item in results['ngrams']['2-grams'][:10]:
    print(f"  {item['text']:30} → {item['frequency']:3} occurrences")

print("\n" + "=" * 60)
print("TOP TRIGRAMS")
print("=" * 60)
for item in results['ngrams']['3-grams'][:10]:
    print(f"  {item['text']:40} → {item['frequency']:3} occurrences")

## 5. Entropy Analysis

In [None]:
import matplotlib.pyplot as plt

entropy_metrics = results['entropy']

print("ENTROPY METRICS")
print("=" * 60)
for metric, value in entropy_metrics.items():
    print(f"  {metric.replace('_', ' ').title():30} → {value:.4f}")

# Visualize
fig, ax = plt.subplots(figsize=(10, 6))
metrics = list(entropy_metrics.keys())
values = list(entropy_metrics.values())

ax.barh(metrics, values, color='steelblue')
ax.set_xlabel('Value')
ax.set_title('Text Complexity Metrics')
ax.grid(axis='x', alpha=0.3)
plt.tight_layout()
plt.show()

## 6. Glyph Fusion Mappings

In [None]:
print("GLYPH FUSION MAPPINGS")
print("=" * 60)
for fusion in results['glyph_fusions'][:15]:
    print(f"  {fusion['pattern']:30} → {fusion['glyph']:3} ({fusion['type']})")

## 7. Export Daily Outputs

Generate Markdown report + JSON archive (Primary Layer format)

In [None]:
output_dir = project_root / 'data' / 'processed' / 'atomization'
output_dir.mkdir(parents=True, exist_ok=True)

md_path, json_path = atomizer.export_daily_output(output_dir)

print(f"Exported outputs:")
print(f"  Markdown: {md_path}")
print(f"  JSON:     {json_path}")

## 8. Recursive Analysis Across Multiple Texts

Compare atomization results across different works

In [None]:
# Load consolidated archive
if json_path.exists():
    with open(json_path, 'r', encoding='utf-8') as f:
        archive = json.load(f)
    
    print(f"Archive contains {len(archive)} atomization runs")
    
    # Compare entropy metrics across runs
    if len(archive) > 1:
        dates = [item['date'] for item in archive]
        shannon_entropies = [item['entropy']['shannon_entropy'] for item in archive]
        
        plt.figure(figsize=(12, 6))
        plt.plot(dates, shannon_entropies, marker='o')
        plt.xlabel('Date')
        plt.ylabel('Shannon Entropy')
        plt.title('Entropy Evolution Across Atomization Runs')
        plt.xticks(rotation=45)
        plt.grid(alpha=0.3)
        plt.tight_layout()
        plt.show()

## Next Steps

1. **Iterate daily**: Run atomization on different text excerpts
2. **Compare works**: Analyze Odyssey vs. Metamorphoses vs. Beowulf
3. **Track patterns**: Monitor how entropy and n-grams evolve
4. **Feed to AI scholarship**: Use atomization results as input for Claude/GPT analysis