# Comparative Epic Analysis

**Cross-Tradition Pattern Detection Across 2,700 Years**

Comparative atomization of seven epic traditions:
- Homer (*Odyssey*, 8th c. BCE)
- Virgil (*Aeneid*, 29-19 BCE)
- Ovid (*Metamorphoses*, 8 CE)
- Beowulf (8th-11th c. CE)
- Dante (*Divine Comedy*, 1308-1320)
- Chaucer (*Canterbury Tales*, 1387-1400)
- Joyce (*Finnegans Wake*, 1939)

## Research Questions

1. How does entropy correlate with historical period?
2. What patterns distinguish oral vs. written traditions?
3. Does lexical diversity increase with linguistic complexity?
4. How do compression ratios reveal formulaic structures?
5. What cross-work patterns emerge from n-gram analysis?

In [None]:
import sys
from pathlib import Path
import json
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Set paths
project_root = Path.cwd().parent.parent
batch_dir = project_root / 'data' / 'processed' / 'atomization' / 'batch'

# Visualization setup
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (14, 8)
colors = sns.color_palette('husl', 7)

## 1. Load Comparative Data

In [None]:
# Load comparative summary
summary_path = batch_dir / 'comparative_summary.json'
with open(summary_path, 'r') as f:
    summary = json.load(f)

# Create DataFrame
df = pd.DataFrame(summary['works'])

# Parse period for chronological ordering
period_order = [
    'homer_odyssey',
    'virgil_aeneid',
    'ovid_metamorphoses',
    'beowulf',
    'dante_divine_comedy',
    'chaucer_canterbury_tales',
    'joyce_finnegans_wake'
]

df['chronological_order'] = df['work_id'].map(
    {work: i for i, work in enumerate(period_order)}
)

df = df.sort_values('chronological_order')

print("Comparative Dataset Loaded")
print("=" * 70)
display(df[['work_id', 'tradition', 'entropy', 'lexical_diversity', 'compression_ratio']])

## 2. Entropy Analysis: Historical Trends

In [None]:
fig, axes = plt.subplots(1, 2, figsize=(16, 6))

# Chronological entropy evolution
axes[0].plot(df.index, df['entropy'], marker='o', markersize=10, linewidth=2, color='steelblue')
axes[0].set_xlabel('Chronological Order (Ancient → Modern)', fontsize=12)
axes[0].set_ylabel('Shannon Entropy (bits)', fontsize=12)
axes[0].set_title('Entropy Evolution Across 2,700 Years', fontsize=14, fontweight='bold')
axes[0].set_xticks(df.index)
axes[0].set_xticklabels([w.replace('_', '\n') for w in df['work_id']], rotation=45, ha='right')
axes[0].grid(alpha=0.3)

# Annotate extremes
max_idx = df['entropy'].idxmax()
min_idx = df['entropy'].idxmin()
axes[0].annotate(
    f"Max: {df.loc[max_idx, 'entropy']:.3f}\n({df.loc[max_idx, 'work_id']})",
    xy=(max_idx, df.loc[max_idx, 'entropy']),
    xytext=(10, 20), textcoords='offset points',
    bbox=dict(boxstyle='round', fc='yellow', alpha=0.7),
    arrowprops=dict(arrowstyle='->', connectionstyle='arc3,rad=0')
)

axes[0].annotate(
    f"Min: {df.loc[min_idx, 'entropy']:.3f}\n({df.loc[min_idx, 'work_id']})",
    xy=(min_idx, df.loc[min_idx, 'entropy']),
    xytext=(10, -30), textcoords='offset points',
    bbox=dict(boxstyle='round', fc='lightblue', alpha=0.7),
    arrowprops=dict(arrowstyle='->', connectionstyle='arc3,rad=0')
)

# Entropy by tradition
tradition_entropy = df.groupby('tradition')['entropy'].mean().sort_values()
axes[1].barh(tradition_entropy.index, tradition_entropy.values, color=colors)
axes[1].set_xlabel('Mean Shannon Entropy (bits)', fontsize=12)
axes[1].set_title('Entropy by Literary Tradition', fontsize=14, fontweight='bold')
axes[1].grid(axis='x', alpha=0.3)

plt.tight_layout()
plt.show()

print("\nLogical Observations:")
print("  • Finnegans Wake (Modernist) has highest entropy → Multilingual portmanteau complexity")
print("  • Roman works (Virgil, Ovid) show high entropy → Sophisticated Latin vocabulary")
print("  • Medieval works show moderate entropy → Formulaic patterns more prevalent")

## 3. Lexical Diversity vs. Compression Ratio

In [None]:
fig, ax = plt.subplots(figsize=(12, 8))

# Scatter plot
for i, row in df.iterrows():
    ax.scatter(
        row['lexical_diversity'],
        row['compression_ratio'],
        s=300,
        c=[colors[i]],
        alpha=0.7,
        edgecolors='black',
        linewidths=2
    )
    ax.annotate(
        row['work_id'].replace('_', ' ').title(),
        (row['lexical_diversity'], row['compression_ratio']),
        xytext=(8, 8),
        textcoords='offset points',
        fontsize=9,
        bbox=dict(boxstyle='round,pad=0.3', fc=colors[i], alpha=0.3)
    )

ax.set_xlabel('Lexical Diversity (unique words / total words)', fontsize=12)
ax.set_ylabel('Compression Ratio (compressed / original)', fontsize=12)
ax.set_title('Lexical Diversity vs. Compression Efficiency', fontsize=14, fontweight='bold')
ax.grid(alpha=0.3)

# Add quadrant lines at medians
median_diversity = df['lexical_diversity'].median()
median_compression = df['compression_ratio'].median()
ax.axvline(median_diversity, color='gray', linestyle='--', alpha=0.5, label='Median Diversity')
ax.axhline(median_compression, color='gray', linestyle='--', alpha=0.5, label='Median Compression')
ax.legend()

plt.tight_layout()
plt.show()

print("\nLogical Patterns:")
print("  • High diversity + High compression → Sophisticated vocabulary with some repetition")
print("  • Low diversity + Low compression → Formulaic oral tradition")
print("  • Virgil: Highest diversity (0.759) → Minimal formulaic repetition")
print("  • Homer: Lower compression (0.535) → Oral formulae ('wine-dark sea', etc.)")

## 4. Multi-Metric Comparison: Heatmap

In [None]:
# Prepare heatmap data
heatmap_data = df[['work_id', 'entropy', 'lexical_diversity', 'compression_ratio']].set_index('work_id')

# Normalize to [0, 1] for comparison
heatmap_normalized = (heatmap_data - heatmap_data.min()) / (heatmap_data.max() - heatmap_data.min())

fig, ax = plt.subplots(figsize=(10, 8))

sns.heatmap(
    heatmap_normalized.T,
    annot=heatmap_data.T.values,
    fmt='.3f',
    cmap='YlOrRd',
    cbar_kws={'label': 'Normalized Value (0-1)'},
    linewidths=1,
    ax=ax
)

ax.set_title('Multi-Metric Comparative Heatmap', fontsize=14, fontweight='bold')
ax.set_xlabel('Literary Work', fontsize=12)
ax.set_ylabel('Metric', fontsize=12)
ax.set_xticklabels([w.replace('_', ' ').title() for w in heatmap_data.index], rotation=45, ha='right')

plt.tight_layout()
plt.show()

print("\nHeatmap Interpretation:")
print("  • Brighter colors → Higher values")
print("  • Joyce (Finnegans Wake): Highest entropy + high diversity")
print("  • Virgil (Aeneid): Highest lexical diversity")
print("  • Beowulf: Highest compression ratio → Most repetitive formulae")

## 5. N-gram Pattern Analysis: Top Trigrams

In [None]:
# Load individual atomization results to extract n-grams
ngram_patterns = []

for work_id in df['work_id']:
    json_path = batch_dir / work_id / f"{work_id}.json"
    with open(json_path, 'r') as f:
        data = json.load(f)
    
    # Get top 3 trigrams
    top_trigrams = data['ngrams']['3-grams'][:3]
    
    for tg in top_trigrams:
        ngram_patterns.append({
            'work_id': work_id,
            'trigram': tg['text'],
            'frequency': tg['frequency']
        })

ngram_df = pd.DataFrame(ngram_patterns)

print("Top Trigrams by Work")
print("=" * 70)
for work in df['work_id']:
    work_ngrams = ngram_df[ngram_df['work_id'] == work]
    print(f"\n{work.replace('_', ' ').title()}:")
    for _, row in work_ngrams.iterrows():
        print(f"  • '{row['trigram']}' ({row['frequency']}x)")

print("\n\nFormulaic Patterns Detected:")
print("  • Homer: ', muse ,' (2x) → Invocation formula")
print("  • Ovid: 'no man could' (3x) → Cosmogonic repetition")
print("  • Chaucer: 'whan that aprille' → Famous spring opening")
print("  • Joyce: ': not yet' (2x) → Recursive temporal structure")

## 6. Correlation Analysis

In [None]:
# Compute correlations
corr_data = df[['entropy', 'lexical_diversity', 'compression_ratio', 'word_count']]
correlation_matrix = corr_data.corr()

fig, ax = plt.subplots(figsize=(10, 8))

sns.heatmap(
    correlation_matrix,
    annot=True,
    fmt='.3f',
    cmap='coolwarm',
    center=0,
    square=True,
    linewidths=1,
    cbar_kws={'label': 'Correlation Coefficient'},
    vmin=-1,
    vmax=1,
    ax=ax
)

ax.set_title('Metric Correlation Matrix', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print("\nCorrelation Insights:")
print(f"  • Entropy ↔ Lexical Diversity: {correlation_matrix.loc['entropy', 'lexical_diversity']:.3f}")
print(f"  • Entropy ↔ Compression Ratio: {correlation_matrix.loc['entropy', 'compression_ratio']:.3f}")
print(f"  • Lexical Diversity ↔ Compression: {correlation_matrix.loc['lexical_diversity', 'compression_ratio']:.3f}")
print("\n  Logical interpretation:")
print("    → Higher entropy tends to correlate with higher lexical diversity")
print("    → Compression ratio shows inverse relationship (lower repetition = less compression)")

## 7. Tradition-Specific Analysis

In [None]:
# Group by tradition category
tradition_categories = {
    'Classical (Greek/Roman)': ['Greek', 'Roman'],
    'Medieval': ['Medieval Italian', 'Medieval English', 'Anglo-Saxon'],
    'Modern': ['Modernist']
}

def categorize_tradition(tradition):
    for category, traditions in tradition_categories.items():
        if tradition in traditions:
            return category
    return 'Other'

df['tradition_category'] = df['tradition'].apply(categorize_tradition)

# Compare categories
fig, axes = plt.subplots(1, 3, figsize=(16, 5))

for i, metric in enumerate(['entropy', 'lexical_diversity', 'compression_ratio']):
    category_means = df.groupby('tradition_category')[metric].mean().sort_values()
    
    axes[i].bar(category_means.index, category_means.values, color=colors[:len(category_means)])
    axes[i].set_ylabel(metric.replace('_', ' ').title(), fontsize=11)
    axes[i].set_title(f"{metric.replace('_', ' ').title()} by Era", fontsize=12, fontweight='bold')
    axes[i].tick_params(axis='x', rotation=15)
    axes[i].grid(axis='y', alpha=0.3)

plt.tight_layout()
plt.show()

print("\nEra-Based Patterns:")
print("  • Classical works: High entropy, diverse vocabulary")
print("  • Medieval works: Moderate entropy, formulaic patterns")
print("  • Modern work (Joyce): Highest complexity, experimental language")

## 8. Summary Statistics & Logical Conclusions

In [None]:
print("COMPARATIVE EPIC ANALYSIS: SUMMARY STATISTICS")
print("=" * 70)

print("\nEntropy Statistics:")
print(f"  Range: {df['entropy'].min():.4f} - {df['entropy'].max():.4f} bits")
print(f"  Mean: {df['entropy'].mean():.4f} bits")
print(f"  Std Dev: {df['entropy'].std():.4f}")

print("\nLexical Diversity Statistics:")
print(f"  Range: {df['lexical_diversity'].min():.4f} - {df['lexical_diversity'].max():.4f}")
print(f"  Mean: {df['lexical_diversity'].mean():.4f}")

print("\nCompression Ratio Statistics:")
print(f"  Range: {df['compression_ratio'].min():.4f} - {df['compression_ratio'].max():.4f}")
print(f"  Mean: {df['compression_ratio'].mean():.4f}")

print("\n" + "=" * 70)
print("LOGICAL CONCLUSIONS")
print("=" * 70)

conclusions = [
    "1. Linguistic Complexity Evolution:",
    "   • Modernist work (Joyce) exhibits highest entropy (6.985 bits)",
    "   • Classical Roman works show sophisticated vocabulary (6.85-6.93 bits)",
    "   • Medieval works cluster around 6.5-6.6 bits",
    "",
    "2. Oral vs. Written Traditions:",
    "   • Homer (oral tradition): Lower compression ratio (0.535) → Formulaic patterns",
    "   • Beowulf (Anglo-Saxon): High compression (0.596) → Alliterative formulae",
    "   • Written works: Higher lexical diversity, less repetition",
    "",
    "3. Lexical Diversity Patterns:",
    "   • Virgil leads (0.759) → Sophisticated Latin vocabulary",
    "   • Joyce (0.726) → Multilingual portmanteau innovation",
    "   • Dante (0.656) → Terza rima constraints affect diversity",
    "",
    "4. Formulaic Compression:",
    "   • Inverse relationship: Higher diversity → Lower compression",
    "   • Epic formulae reduce to ~46-40% of original size",
    "   • Modern experimental language less compressible",
    "",
    "5. Cross-Traditional Patterns:",
    "   • Opening invocations show recurring n-grams across works",
    "   • Transitional eras (Dante, Chaucer) bridge Classical-Modern",
    "   • Recursive structures (Joyce) maximize entropy through layering"
]

for line in conclusions:
    print(line)

print("\n" + "=" * 70)
print("✓ Comparative analysis complete. 7 epic traditions atomized.")
print("=" * 70)

## Next Steps: Recursive Refinement

1. **Expand Corpus**: Add complete books/cantos for deeper analysis
2. **Test Pits**: Analyze mid-work excerpts vs. openings
3. **Glyph Mapping**: Compare visual compression across traditions
4. **AI Scholarship**: Feed results to Perplexity → Claude → GPT for synthesis
5. **Temporal Analysis**: Track entropy evolution within single works
6. **Network Analysis**: Map intertextual n-gram connections