# BabyLM-100M Dataset Sampling Pipeline

## Overview

This notebook provides sampling utilities for creating smaller, focused datasets from the BabyLM-100M corpus while preserving linguistic diversity.

### Dataset Components
- **BNC Spoken**: British National Corpus spoken data
- **CHILDES**: Child language transcripts  
- **Gutenberg**: Project Gutenberg literature
- **OpenSubtitles**: Movie/TV subtitles
- **SimpleWiki**: Simplified Wikipedia
- **Switchboard**: Telephone conversations

### Quick Start
```python
# Mixed dataset (1M words)
sample_proportions("mixed_1M", 1_000_000, 
                  bnc_spoken=0.2, childes=0.2, gutenberg=0.2, 
                  open_subtitles=0.2, simple_wiki=0.2)

# Literature sampling (50% of books)
sample_percentage_of_books(0.5, 1_000_000, "gutenberg_50pct")
```

## 1. Core Sampling Functions

**Sampling Strategy**: Uses 10,000-word snippets to preserve context while ensuring diverse coverage.

| Function | Purpose |
|----------|---------|
| `sample_from_single_file()` | Sample from individual corpus files |
| `sample_proportions()` | Create mixed datasets with specified proportions |
| `sample_percentage_of_books()` | Sample percentage of Gutenberg books |

In [2]:
import random
import os

SNIPPET_SIZE = 10_000

def sample_from_single_file(file_name, target_words, target_folder):
    with open(f"../datasets/BabyLM_dataset/train_100M/{file_name}.train", "r", encoding="utf-8") as f:
        words = f.read().split()
    total_words = len(words)
    num_snippets = int(1.2 * target_words) // SNIPPET_SIZE
    if (total_words > target_words + int(0.2 * target_words)):
        max_start = total_words - SNIPPET_SIZE
        starts = random.sample(range(max_start), num_snippets)

        snippets = [words[start:start + SNIPPET_SIZE] for start in starts]

        sampled_words = [word for snippet in snippets for word in snippet]

        train_words = sampled_words[:target_words]
        dev_words = sampled_words[target_words:]

        # Fix: Write only train words to main file
        with open(f"../datasets/BabyLM_dataset/{target_folder}/{file_name}.train", "w+", encoding="utf-8") as f:
            f.write(" ".join(train_words))  # Changed from sampled_words to train_words
        with open(f"../datasets/BabyLM_dataset/{target_folder}/{file_name}_dev.train", "w+", encoding="utf-8") as f:
            f.write(" ".join(dev_words))
    else: 
        print(f"File {file_name} has only {total_words} words, not enough to sample {target_words} words.")



In [3]:

def sample_proportions (output_name, no_words, bnc_spoken, childes, gutenberg, open_subtitles, simple_wiki, switchboard):
    # if bnc_spoken + childes + gutenberg + open_subtitles + simple_wiki + switchboard != 1:
    #     raise ValueError("Proportions must sum to 1.")

    files = ["bnc_spoken.train", "childes.train", "gutenberg.train", "open_subtitles.train", "simple_wiki.train", "switchboard.train"]
    proportions = [bnc_spoken, childes, gutenberg, open_subtitles, simple_wiki, switchboard]

    train_words = []
    dev_words = []
    
    for i in range(len(files)):
        file = files[i]
        total_words_needed = int((1.2 * no_words) * proportions[i])
        train_words_needed = int(no_words * proportions[i])
        dev_words_needed = total_words_needed - train_words_needed
        
        if total_words_needed == 0:
            continue
            
        with open(f"../datasets/BabyLM_dataset/train_100M/{file}", "r", encoding="utf-8") as f:
            words = f.read().split()
        total_words = len(words)

        if total_words < total_words_needed:
            print(f"File {file} has only {total_words} words, not enough to sample {total_words_needed} words.")
            continue

        # Calculate snippets needed for this file specificallyNUM_SNIPnum_snippetsPETS
        snippets_needed = (total_words_needed + SNIPPET_SIZE - 1) // SNIPPET_SIZE  # Ceiling division
        max_start = total_words - SNIPPET_SIZE
        starts = random.sample(range(max_start), min(snippets_needed, max_start))
        snippets = [words[start:start + SNIPPET_SIZE] for start in starts]
        file_words = [word for snippet in snippets for word in snippet][:total_words_needed]
        
        # Split this file's words into train and dev proportionally
        file_train_words = file_words[:train_words_needed]
        file_dev_words = file_words[train_words_needed:train_words_needed + dev_words_needed]
        
        train_words.extend(file_train_words)
        dev_words.extend(file_dev_words)
    
    print(f"Train words: {len(train_words)}")
    print(f"Dev words: {len(dev_words)}")
    with open(f"../datasets/BabyLM_dataset/{output_name}.train", "w+", encoding="utf-8") as f:
        f.write(" ".join(train_words))
    with open(f"../datasets/BabyLM_dataset/{output_name}_dev.train", "w+", encoding="utf-8") as f:
        f.write(" ".join(dev_words))


In [None]:
def sample_percentage_of_books(percentage, target_words, filename):
    """Sample a percentage of books from the Gutenberg corpus
    
    Args:
        percentage: Fraction of books to sample (0.0 to 1.0)
        target_words: Number of words for training set
        filename: Output filename prefix
    """
    with open(f"../datasets/BabyLM_dataset/train_100M/gutenberg.train", "r", encoding="utf-8") as f:
        all_books = f.read()
    lines = all_books.split("\n")

    # Find book boundaries
    beginning_indices = []
    for i in range(len(lines)):
        if lines[i].startswith("= = = "):
            beginning_indices.append(i)
    
    target_with_dev = int(1.2 * target_words)
    
    # Sample the specified percentage of books
    text = ""
    total_words = 0
    books_used = 0
    
    for i in range(int(len(beginning_indices) * percentage)):
        if i + 1 >= len(beginning_indices):
            break
            
        number_of_lines = beginning_indices[i + 1] - beginning_indices[i] - 1
        last_index = beginning_indices[i] + int(percentage * number_of_lines)
        book_text = " ".join(lines[(beginning_indices[i] + 1):last_index])
        total_words += len(book_text.split())
        text += book_text + " "
        books_used += 1
        
        if total_words >= target_with_dev:
            print(f"Sampled {total_words} words from {i + 1} books.")
            text_words = text.split()
            train_text = " ".join(text_words[:target_words])
            print(f"Train text length: {len(train_text.split())} words.")
            dev_text = " ".join(text_words[target_words:target_with_dev])
            print(f"Dev text length: {len(dev_text.split())} words.")
            break

    with open(f"../datasets/BabyLM_dataset/books_context/{filename}.train", "w", encoding="utf-8") as f:
        f.write(train_text)
    with open(f"../datasets/BabyLM_dataset/books_context/{filename}_dev.train", "w", encoding="utf-8") as f:
        f.write(dev_text)

# Example usage (uncomment to run):
# sample_percentage_of_books(0.25, 1_000_000, "gutenberg_1M_25pct_books")
# sample_percentage_of_books(0.5, 1_000_000, "gutenberg_1M_50pct_books")
# sample_percentage_of_books(0.75, 1_000_000, "gutenberg_1M_75pct_books")
# sample_percentage_of_books(1.0, 1_000_000, "gutenberg_1M_100pct_books")

Sampled 1212910 words from 107 books.
Train text length: 1000000 words.
Dev text length: 200000 words.


FileNotFoundError: [Errno 2] No such file or directory: '../datasets/BabyLM_dataset/books_context/gutenberg_1M_25pct_books.train'

## 2. Book Sampling & Analysis

In [None]:
def analyze_book_dataset():
    """Analyze the Gutenberg book dataset statistics"""
    print("Analyzing Gutenberg book dataset...")
    
    with open(f"../datasets/BabyLM_dataset/train_100M/gutenberg.train", "r", encoding="utf-8") as f:
        all_books = f.read()
    
    lines = all_books.split("\n")
    
    # Find book boundaries
    beginning_indices = []
    for i in range(len(lines)):
        if lines[i].startswith("= = = "):
            beginning_indices.append(i)
    
    # Add the end of the file as the last boundary
    beginning_indices.append(len(lines))
    
    total_books = len(beginning_indices) - 1  # Subtract 1 because we added the end index
    print(f"Total number of books in dataset: {total_books}")
    
    # Calculate word count for each book
    book_word_counts = []
    total_words_all_books = 0
    
    for i in range(total_books):
        start_line = beginning_indices[i] + 1  # Skip the header line
        end_line = beginning_indices[i + 1]
        
        book_text = " ".join(lines[start_line:end_line])
        word_count = len(book_text.split())
        book_word_counts.append(word_count)
        total_words_all_books += word_count
    
    # Calculate statistics
    average_words_per_book = total_words_all_books / total_books
    print(f"Average words per book: {average_words_per_book:.0f}")
    print(f"Total words in all books: {total_words_all_books:,}")
    
    # Calculate number of books for each percentage
    percentages = [0.25, 0.5, 0.75, 1.0]
    for pct in percentages:
        num_books = int(total_books * pct)
        print(f"\n{pct*100:.0f}% sampling uses approximately {num_books} books")
        
        # Calculate actual words available
        words_used = 0
        for i in range(num_books):
            start_line = beginning_indices[i] + 1
            end_line = beginning_indices[i + 1]
            number_of_lines = end_line - start_line
            lines_to_use = int(pct * number_of_lines)
            book_text = " ".join(lines[start_line:(start_line + lines_to_use)])
            words_used += len(book_text.split())
        
        print(f"  Total words available: {words_used:,}")
    
    return {
        'total_books': total_books,
        'average_words_per_book': average_words_per_book,
        'total_words': total_words_all_books,
        'book_word_counts': book_word_counts
    }

# Run the analysis
stats = analyze_book_dataset()

Analyzing Gutenberg book dataset...
Total number of books in dataset: 563
Average words per book: 46834
Total words in all books: 26,367,293
25% sampling uses approximately 140 books
  - Total words available from 25% sampling: 1,863,965
50% sampling uses approximately 281 books
  - Total words available from 50% sampling: 6,926,689
75% sampling uses approximately 422 books
  - Total words available from 75% sampling: 15,171,320
100% sampling uses approximately 563 books
  - Total words available from 100% sampling: 26,367,293


## 3. Quantifier Analysis

Analysis of quantifier usage patterns across different text genres.

In [None]:
import re
import os
from collections import defaultdict

def count_quantifiers_in_gutenberg_genres():
    """Count occurrences of quantifiers in all Gutenberg genre datasets"""
    
    quantifiers = [
        # Basic universal quantifiers
        "all", "every", "each", "everyone", "everybody", "everything", "everywhere",
        # Existential quantifiers
        "some", "any", "someone", "somebody", "something", "somewhere", 
        "anyone", "anybody", "anything", "anywhere",
        # Negative quantifiers
        "no", "none", "nothing", "nobody", "no one", "nowhere", "neither", "never",
        # Numerical/proportional quantifiers
        "most", "many", "much", "few", "a few", "little", "a little", "several",
        "numerous", "countless",
        # Old English and literary quantifiers
        "many a", "such", "divers", "sundry", "manifold", "myriad", "legion",
        # Distributive quantifiers
        "either", "both", "half", "whole", "entire",
        # Indefinite quantifiers
        "certain", "various", "different", "other", "another", "others",
        "majority", "minority", "bulk", "rest",
        # Archaic forms
        "nary", "naught", "aught", "ought", "whit",
        # Frequency quantifiers
        "always", "often", "sometimes", "rarely", "seldom", "once", "twice", "thrice",
        # Common collective terms
        "couple", "dozen", "score", "hundred", "thousand", "multitude", "host"
    ]
    
    genres_dir = "../datasets/gutenberg/genres"
    train_files = [f for f in os.listdir(genres_dir) if f.endswith('.train') and '_dev.train' not in f]
    
    print(f"Analyzing {len(train_files)} genre datasets...")
    
    # Map file names to genres
    genre_mapping = {
        'mystery': 'Mystery',
        'romance': 'Romance', 
        'sci-fi_fantasy': 'Sci-Fi/Fantasy',
        'self_help_non_fiction': 'Self-Help/Non-Fiction',
        'youth_and_ya_gutenberg': 'Youth/YA',
        'old_english_drama_poetry': 'Old English Drama/Poetry'
    }
    
    file_counts = {}
    total_counts = defaultdict(int)
    genre_counts = defaultdict(int)
    
    for file_name in train_files:
        file_path = os.path.join(genres_dir, file_name)
        
        # Determine genre from filename
        genre = None
        for key, value in genre_mapping.items():
            if key in file_name:
                genre = value
                break
        if not genre:
            genre = "Unknown"
        
        try:
            with open(file_path, 'r', encoding='utf-8') as f:
                text = f.read().lower()
        except Exception as e:
            print(f"Error reading {file_name}: {e}")
            continue
        
        file_counts[file_name] = {}
        file_total = 0
        
        for quantifier in quantifiers:
            if ' ' in quantifier:
                pattern = r'\b' + re.escape(quantifier.lower()) + r'\b'
            else:
                pattern = r'\b' + re.escape(quantifier.lower()) + r'\b'
            
            matches = re.findall(pattern, text)
            count = len(matches)
            
            file_counts[file_name][quantifier] = count
            total_counts[quantifier] += count
            file_total += count
        
        genre_counts[genre] += file_total
        print(f"  {file_name}: {file_total:,} quantifiers ({genre})")
    
    print(f"\nResults by Genre:")
    print("-" * 40)
    for genre, total_count in sorted(genre_counts.items(), key=lambda x: x[1], reverse=True):
        print(f"{genre:<25}: {total_count:>8,}")
    
    grand_total = sum(total_counts.values())
    print(f"\nGrand Total: {grand_total:,}")
    
    return {
        'total_counts': dict(total_counts),
        'file_counts': file_counts,
        'genre_counts': dict(genre_counts),
        'grand_total': grand_total
    }

# Run the analysis
print("Counting quantifiers in Gutenberg genre datasets...")
results = count_quantifiers_in_gutenberg_genres()

Counting quantifiers in Gutenberg genre datasets...
Found 6 dataset files:
  - 1M_mystery_books.train
  - 1M_old_english_drama_poetry.train
  - 1M_sci-fi_fantasy.train
  - 1M_self_help_non_fiction.train
  - 1M_romance.train
  - 1M_youth_and_ya_gutenberg.train


Processing 1M_mystery_books.train...
  Total quantifiers found: 39,282 (Genre: Mystery)
Processing 1M_old_english_drama_poetry.train...
  Total quantifiers found: 39,282 (Genre: Mystery)
Processing 1M_old_english_drama_poetry.train...
  Total quantifiers found: 32,558 (Genre: Old English Drama/Poetry)
Processing 1M_sci-fi_fantasy.train...
  Total quantifiers found: 32,558 (Genre: Old English Drama/Poetry)
Processing 1M_sci-fi_fantasy.train...
  Total quantifiers found: 42,419 (Genre: Sci-Fi/Fantasy)
Processing 1M_self_help_non_fiction.train...
  Total quantifiers found: 42,419 (Genre: Sci-Fi/Fantasy)
Processing 1M_self_help_non_fiction.train...
  Total quantifiers found: 43,521 (Genre: Self-Help/Non-Fiction)
Processing 1M_romanc

### 3.2 Frequency Analysis by Genre

In [None]:
def analyze_quantifier_frequencies_by_genre():
    """Detailed frequency analysis of quantifiers across genres"""
    
    quantifiers = [
        "all", "every", "each", "everyone", "everybody", "everything", "everywhere",
        "some", "any", "someone", "somebody", "something", "somewhere", 
        "anyone", "anybody", "anything", "anywhere",
        "no", "none", "nothing", "nobody", "no one", "nowhere", "neither", "never",
        "most", "many", "much", "few", "a few", "little", "a little", "several",
        "numerous", "countless", "many a", "such", "divers", "sundry", "manifold", 
        "myriad", "legion", "either", "both", "half", "whole", "entire",
        "certain", "various", "different", "other", "another", "others",
        "majority", "minority", "bulk", "rest", "nary", "naught", "aught", "ought", "whit",
        "always", "often", "sometimes", "rarely", "seldom", "once", "twice", "thrice",
        "couple", "dozen", "score", "hundred", "thousand", "multitude", "host"
    ]
    
    genres_dir = "../datasets/gutenberg/genres"
    train_files = [f for f in os.listdir(genres_dir) if f.endswith('.train') and '_dev.train' not in f]
    
    genre_mapping = {
        'mystery': 'Mystery',
        'romance': 'Romance', 
        'sci-fi_fantasy': 'Sci-Fi/Fantasy',
        'self_help_non_fiction': 'Self-Help/Non-Fiction',
        'youth_and_ya_gutenberg': 'Youth/YA',
        'old_english_drama_poetry': 'Old English Drama/Poetry'
    }
    
    print("QUANTIFIER FREQUENCY ANALYSIS BY GENRE")
    print("="*80)
    
    quantifier_by_genre = defaultdict(lambda: defaultdict(int))
    genre_word_counts = {}
    genre_quantifier_totals = defaultdict(int)
    
    # Process each file
    for file_name in train_files:
        file_path = os.path.join(genres_dir, file_name)
        
        # Determine genre
        genre = None
        for key, value in genre_mapping.items():
            if key in file_name:
                genre = value
                break
        if not genre:
            genre = f"Unknown ({file_name})"
        
        print(f"Processing {file_name} -> {genre}...")
        
        try:
            with open(file_path, 'r', encoding='utf-8') as f:
                text = f.read().lower()
        except Exception as e:
            print(f"  ERROR: {e}")
            continue
        
        total_words = len(text.split())
        if genre not in genre_word_counts:
            genre_word_counts[genre] = 0
        genre_word_counts[genre] += total_words
        
        # Count quantifiers
        file_quantifier_total = 0
        for quantifier in quantifiers:
            if ' ' in quantifier:
                pattern = r'\b' + re.escape(quantifier.lower()) + r'\b'
            else:
                pattern = r'\b' + re.escape(quantifier.lower()) + r'\b'
            
            matches = re.findall(pattern, text)
            count = len(matches)
            quantifier_by_genre[genre][quantifier] += count
            file_quantifier_total += count
        
        genre_quantifier_totals[genre] += file_quantifier_total
        density = (file_quantifier_total/total_words)*1000
        print(f"  → {total_words:,} words, {file_quantifier_total:,} quantifiers ({density:.2f}/1000)")
    
    # Summary results
    print("\n" + "="*60)
    print("QUANTIFIER DENSITY BY GENRE")
    print("="*60)
    
    for genre in sorted(genre_quantifier_totals.keys()):
        total_words = genre_word_counts[genre]
        total_quantifiers = genre_quantifier_totals[genre]
        density = (total_quantifiers / total_words) * 1000
        print(f"{genre:<25}: {density:>6.2f} per 1000 words ({total_quantifiers:,} total)")
    
    # Top quantifiers across all genres
    all_quantifier_totals = defaultdict(int)
    for genre_data in quantifier_by_genre.values():
        for quantifier, count in genre_data.items():
            all_quantifier_totals[quantifier] += count
    
    print(f"\nTOP 15 QUANTIFIERS OVERALL:")
    print("-" * 40)
    sorted_quantifiers = sorted(all_quantifier_totals.items(), key=lambda x: x[1], reverse=True)
    for i, (quantifier, total_count) in enumerate(sorted_quantifiers[:15], 1):
        print(f"{i:2d}. {quantifier:<15}: {total_count:>6,}")
    
    return {
        'quantifier_by_genre': dict(quantifier_by_genre),
        'genre_word_counts': genre_word_counts,
        'genre_quantifier_totals': dict(genre_quantifier_totals),
        'all_quantifier_totals': dict(all_quantifier_totals)
    }

# Run the detailed frequency analysis
print("Starting quantifier frequency analysis...")
quantifier_freq_results = analyze_quantifier_frequencies_by_genre()

Starting detailed quantifier frequency analysis...
QUANTIFIER FREQUENCY ANALYSIS BY GENRE
Found 6 genre files:
  - 1M_mystery_books.train
  - 1M_old_english_drama_poetry.train
  - 1M_sci-fi_fantasy.train
  - 1M_self_help_non_fiction.train
  - 1M_romance.train
  - 1M_youth_and_ya_gutenberg.train

Processing 1M_mystery_books.train -> Mystery...
  → Words in this file: 1,000,116
  → Quantifiers in this file: 39,282
  → Quantifier density: 39.28 per 1000 words

Processing 1M_old_english_drama_poetry.train -> Old English Drama/Poetry...
  → Words in this file: 1,000,116
  → Quantifiers in this file: 39,282
  → Quantifier density: 39.28 per 1000 words

Processing 1M_old_english_drama_poetry.train -> Old English Drama/Poetry...
  → Words in this file: 1,000,251
  → Quantifiers in this file: 32,558
  → Quantifier density: 32.55 per 1000 words

Processing 1M_sci-fi_fantasy.train -> Sci-Fi/Fantasy...
  → Words in this file: 1,000,251
  → Quantifiers in this file: 32,558
  → Quantifier density: 3

In [9]:
import math
from collections import Counter

def analyze_quantifier_variety_and_diversity():
    """
    Comprehensive analysis of quantifier variety and diversity across genres.
    This will tell us if Old English Drama/Poetry has more diverse quantifier usage.
    """
    print("QUANTIFIER VARIETY & DIVERSITY ANALYSIS")
    print("="*80)
    
    if 'quantifier_freq_results' not in locals() and 'quantifier_freq_results' not in globals():
        print("ERROR: quantifier_freq_results not found. Please run the frequency analysis first.")
        return
    
    results = quantifier_freq_results
    
    # Data structures for analysis
    genre_variety_stats = {}
    
    for genre in sorted(results['quantifier_by_genre'].keys()):
        genre_data = results['quantifier_by_genre'][genre]
        total_words = results['genre_word_counts'][genre]
        
        print(f"\nAnalyzing {genre}...")
        print("-" * 50)
        
        # Filter out quantifiers with 0 counts
        used_quantifiers = {q: count for q, count in genre_data.items() if count > 0}
        
        # 1. BASIC VARIETY METRICS
        unique_quantifiers = len(used_quantifiers)
        total_quantifier_instances = sum(used_quantifiers.values())
        
        print(f"Basic Variety:")
        print(f"  • Unique quantifiers used: {unique_quantifiers}")
        print(f"  • Total quantifier instances: {total_quantifier_instances:,}")
        print(f"  • Average uses per unique quantifier: {total_quantifier_instances/unique_quantifiers:.1f}")
        
        # 2. SHANNON DIVERSITY INDEX
        # H = -Σ(p_i * ln(p_i)) where p_i is proportion of species i
        proportions = [count/total_quantifier_instances for count in used_quantifiers.values()]
        shannon_diversity = -sum(p * math.log(p) for p in proportions if p > 0)
        shannon_evenness = shannon_diversity / math.log(unique_quantifiers) if unique_quantifiers > 1 else 0
        
        print(f"Shannon Diversity:")
        print(f"  • Shannon Index (H): {shannon_diversity:.3f}")
        print(f"  • Shannon Evenness (J): {shannon_evenness:.3f}")
        print(f"  • Max possible H for {unique_quantifiers} quantifiers: {math.log(unique_quantifiers):.3f}")
        
        # 3. SIMPSON DIVERSITY INDEX
        # D = 1 - Σ(p_i^2) - probability that two randomly selected are different
        simpson_diversity = 1 - sum(p**2 for p in proportions)
        simpson_reciprocal = 1 / sum(p**2 for p in proportions) if sum(p**2 for p in proportions) > 0 else 0
        
        print(f"Simpson Diversity:")
        print(f"  • Simpson Index (1-D): {simpson_diversity:.3f}")
        print(f"  • Simpson Reciprocal: {simpson_reciprocal:.1f}")
        
        # 4. DISTRIBUTION ANALYSIS
        counts_list = list(used_quantifiers.values())
        counts_list.sort(reverse=True)
        
        # Gini coefficient (measure of inequality)
        n = len(counts_list)
        gini = (2 * sum((i + 1) * count for i, count in enumerate(sorted(counts_list)))) / (n * sum(counts_list)) - (n + 1) / n
        
        print(f"Distribution Analysis:")
        print(f"  • Most used quantifier: {max(counts_list)} times")
        print(f"  • Least used quantifier: {min(counts_list)} times")
        print(f"  • Median usage: {sorted(counts_list)[len(counts_list)//2]}")
        print(f"  • Gini coefficient: {gini:.3f} (0=perfectly equal, 1=maximally unequal)")
        
        # 5. RARE QUANTIFIERS (used ≤ 10 times)
        rare_quantifiers = {q: count for q, count in used_quantifiers.items() if count <= 10}
        common_quantifiers = {q: count for q, count in used_quantifiers.items() if count > 100}
        
        print(f"Rare vs Common Quantifiers:")
        print(f"  • Rare quantifiers (≤10 uses): {len(rare_quantifiers)} ({len(rare_quantifiers)/unique_quantifiers*100:.1f}%)")
        print(f"  • Common quantifiers (>100 uses): {len(common_quantifiers)} ({len(common_quantifiers)/unique_quantifiers*100:.1f}%)")
        
        # 6. ARCHAIC QUANTIFIER VARIETY
        archaic_quantifiers = ["ought", "aught", "naught", "whit", "nary", "divers", "sundry", "manifold", "many a", "myriad", "legion", "thrice", "manifold"]
        archaic_used = {q: used_quantifiers.get(q, 0) for q in archaic_quantifiers if used_quantifiers.get(q, 0) > 0}
        
        print(f"Archaic Quantifier Variety:")
        print(f"  • Archaic quantifiers used: {len(archaic_used)}")
        print(f"  • Total archaic instances: {sum(archaic_used.values())}")
        if archaic_used:
            print(f"  • Most used archaic: {max(archaic_used, key=archaic_used.get)} ({archaic_used[max(archaic_used, key=archaic_used.get)]} times)")
        
        # Store results for comparison
        genre_variety_stats[genre] = {
            'unique_quantifiers': unique_quantifiers,
            'total_instances': total_quantifier_instances,
            'shannon_diversity': shannon_diversity,
            'shannon_evenness': shannon_evenness,
            'simpson_diversity': simpson_diversity,
            'simpson_reciprocal': simpson_reciprocal,
            'gini_coefficient': gini,
            'rare_quantifiers': len(rare_quantifiers),
            'rare_percentage': len(rare_quantifiers)/unique_quantifiers*100,
            'common_quantifiers': len(common_quantifiers),
            'archaic_variety': len(archaic_used),
            'archaic_instances': sum(archaic_used.values())
        }
    
    # COMPARATIVE ANALYSIS
    print("\n" + "="*80)
    print("COMPARATIVE VARIETY ANALYSIS")
    print("="*80)
    
    # Sort genres by different diversity metrics
    metrics = [
        ('unique_quantifiers', 'Number of Unique Quantifiers', False),
        ('shannon_diversity', 'Shannon Diversity Index', False),
        ('simpson_diversity', 'Simpson Diversity Index', False),
        ('rare_percentage', 'Percentage of Rare Quantifiers', False),
        ('archaic_variety', 'Number of Archaic Quantifiers Used', False),
        ('gini_coefficient', 'Gini Coefficient (Inequality)', True),  # True means lower is better for diversity
    ]
    
    for metric_key, metric_name, reverse_better in metrics:
        print(f"\n{metric_name}:")
        print("-" * 60)
        
        sorted_genres = sorted(genre_variety_stats.items(), 
                             key=lambda x: x[1][metric_key], 
                             reverse=not reverse_better)
        
        for i, (genre, stats) in enumerate(sorted_genres, 1):
            value = stats[metric_key]
            if metric_key == 'rare_percentage':
                print(f"{i}. {genre:<25}: {value:>6.1f}%")
            elif 'diversity' in metric_key or 'gini' in metric_key:
                print(f"{i}. {genre:<25}: {value:>6.3f}")
            else:
                print(f"{i}. {genre:<25}: {value:>6}")
    
    # SPECIFIC FOCUS ON OLD ENGLISH DRAMA/POETRY
    print("\n" + "="*80)
    print("FOCUS: OLD ENGLISH DRAMA/POETRY vs OTHER GENRES")
    print("="*80)
    
    old_english = "Old English Drama/Poetry"
    if old_english in genre_variety_stats:
        old_stats = genre_variety_stats[old_english]
        
        print(f"\n{old_english} Diversity Profile:")
        print("-" * 50)
        
        # Calculate rankings
        rankings = {}
        for metric_key, metric_name, reverse_better in metrics:
            sorted_genres = sorted(genre_variety_stats.items(), 
                                 key=lambda x: x[1][metric_key], 
                                 reverse=not reverse_better)
            for i, (genre, stats) in enumerate(sorted_genres, 1):
                if genre == old_english:
                    rankings[metric_key] = i
                    break
        
        for metric_key, metric_name, reverse_better in metrics:
            rank = rankings[metric_key]
            value = old_stats[metric_key]
            total_genres = len(genre_variety_stats)
            
            if metric_key == 'rare_percentage':
                print(f"• {metric_name}: {value:.1f}% (Rank {rank}/{total_genres})")
            elif 'diversity' in metric_key or 'gini' in metric_key:
                print(f"• {metric_name}: {value:.3f} (Rank {rank}/{total_genres})")
            else:
                print(f"• {metric_name}: {value} (Rank {rank}/{total_genres})")
        
        # Specific insights
        print(f"\nKey Insights:")
        print("-" * 20)
        
        if rankings['unique_quantifiers'] == 1:
            print("✓ OLD ENGLISH has the HIGHEST quantifier variety!")
        elif rankings['unique_quantifiers'] <= 2:
            print("• OLD ENGLISH has very high quantifier variety (top 2)")
        else:
            print(f"• OLD ENGLISH ranks #{rankings['unique_quantifiers']} in quantifier variety")
        
        if rankings['shannon_diversity'] == 1:
            print("✓ OLD ENGLISH has the most BALANCED quantifier usage!")
        elif rankings['shannon_diversity'] <= 2:
            print("• OLD ENGLISH has very balanced quantifier usage (top 2)")
            
        if rankings['archaic_variety'] == 1:
            print("✓ OLD ENGLISH uses the most ARCHAIC quantifiers!")
        elif rankings['archaic_variety'] <= 2:
            print("• OLD ENGLISH has high archaic quantifier usage (top 2)")
    
    return genre_variety_stats

# Run the comprehensive variety analysis
print("Starting comprehensive quantifier variety analysis...")
variety_results = analyze_quantifier_variety_and_diversity()

Starting comprehensive quantifier variety analysis...
QUANTIFIER VARIETY & DIVERSITY ANALYSIS

Analyzing Mystery...
--------------------------------------------------
Basic Variety:
  • Unique quantifiers used: 102
  • Total quantifier instances: 39,282
  • Average uses per unique quantifier: 385.1
Shannon Diversity:
  • Shannon Index (H): 3.700
  • Shannon Evenness (J): 0.800
  • Max possible H for 102 quantifiers: 4.625
Simpson Diversity:
  • Simpson Index (1-D): 0.962
  • Simpson Reciprocal: 26.2
Distribution Analysis:
  • Most used quantifier: 3689 times
  • Least used quantifier: 1 times
  • Median usage: 132
  • Gini coefficient: 0.700 (0=perfectly equal, 1=maximally unequal)
Rare vs Common Quantifiers:
  • Rare quantifiers (≤10 uses): 17 (16.7%)
  • Common quantifiers (>100 uses): 54 (52.9%)
Archaic Quantifier Variety:
  • Archaic quantifiers used: 11
  • Total archaic instances: 152
  • Most used archaic: ought (97 times)

Analyzing Old English Drama/Poetry...
-----------------

### 3.3 Diversity Analysis

Analysis using Shannon and Simpson diversity indices to measure quantifier variety across genres.

In [15]:
# PARADOX ANALYSIS: Why does Old English Drama/Poetry train better models despite lower variety?
import math

print("🔍 PARADOX ANALYSIS: OLD ENGLISH DRAMA/POETRY MODEL PERFORMANCE")
print("="*80)

print("THE PUZZLE:")
print("-" * 20)
print("• Analysis shows Old English Drama/Poetry ranks LAST in quantifier variety")
print("• Yet models trained on it OUTPERFORM others on quantifier tasks")
print("• This suggests variety ≠ learnability. Let's investigate why...")

print("\n" + "="*80)
print("HYPOTHESIS TESTING: What makes Old English Drama/Poetry special for learning?")
print("="*80)

if 'quantifier_freq_results' in locals() and 'variety_results' in locals():
    old_english = "Old English Drama/Poetry"
    results = quantifier_freq_results
    variety_stats = variety_results
    
    old_data = results['quantifier_by_genre'][old_english]
    old_variety = variety_stats[old_english]
    old_words = results['genre_word_counts'][old_english]
    
    # HYPOTHESIS 1: Quality over Quantity - Distinct Usage Patterns
    print("\n1. HYPOTHESIS: DISTINCTIVE USAGE PATTERNS")
    print("-" * 50)
    
    print("Old English may have fewer quantifiers but use them in more distinctive ways:")
    
    # Find quantifiers where Old English usage is distinctly different
    distinctive_patterns = []
    
    for quantifier, count in old_data.items():
        if count > 0:
            old_freq = (count / old_words) * 1000
            
            # Calculate usage in other genres combined
            other_total = sum(results['quantifier_by_genre'][g].get(quantifier, 0) 
                             for g in results['quantifier_by_genre'].keys() if g != old_english)
            other_words = sum(results['genre_word_counts'][g] 
                             for g in results['genre_word_counts'].keys() if g != old_english)
            other_freq = (other_total / other_words) * 1000 if other_words > 0 else 0
            
            if other_freq > 0:
                ratio = old_freq / other_freq
                if ratio > 1.5 or ratio < 0.7:  # Significantly different usage
                    distinctive_patterns.append((quantifier, count, old_freq, other_freq, ratio))
    
    distinctive_patterns.sort(key=lambda x: abs(math.log(x[4])), reverse=True)
    
    print(f"Quantifiers with DISTINCTIVE usage patterns in Old English ({len(distinctive_patterns)} found):")
    for quantifier, count, old_freq, other_freq, ratio in distinctive_patterns[:10]:
        direction = "MORE" if ratio > 1 else "LESS"
        print(f"  • {quantifier:<15}: {ratio:>5.2f}x {direction} than others ({old_freq:>5.2f} vs {other_freq:>5.2f}/1000)")
    
    # HYPOTHESIS 2: Structural Consistency - Lower Variance in Usage
    print("\n2. HYPOTHESIS: CONSISTENT STRUCTURAL PATTERNS")
    print("-" * 50)
    
    print("Old English might have more consistent quantifier usage patterns:")
    print(f"• Gini coefficient (inequality): {old_variety['gini_coefficient']:.3f}")
    print(f"• Shannon evenness: {old_variety['shannon_evenness']:.3f}")
    
    # Compare with other genres
    gini_comparison = []
    evenness_comparison = []
    for genre, stats in variety_stats.items():
        gini_comparison.append((genre, stats['gini_coefficient']))
        evenness_comparison.append((genre, stats['shannon_evenness']))
    
    gini_comparison.sort(key=lambda x: x[1])  # Lower Gini = more equal distribution
    evenness_comparison.sort(key=lambda x: x[1], reverse=True)  # Higher evenness = better
    
    print(f"\nGini coefficient rankings (lower = more consistent):")
    for i, (genre, gini) in enumerate(gini_comparison, 1):
        mark = "📍" if genre == old_english else "  "
        print(f"{mark} {i}. {genre:<25}: {gini:.3f}")
    
    print(f"\nShannon evenness rankings (higher = more balanced):")
    for i, (genre, evenness) in enumerate(evenness_comparison, 1):
        mark = "📍" if genre == old_english else "  "
        print(f"{mark} {i}. {genre:<25}: {evenness:.3f}")
    
    # HYPOTHESIS 3: Archaic/Formal Quantifiers = Better Grammatical Structure
    print("\n3. HYPOTHESIS: ARCHAIC QUANTIFIERS = BETTER GRAMMAR")
    print("-" * 50)
    
    archaic_quantifiers = ["ought", "aught", "naught", "whit", "nary", "divers", "sundry", "manifold", "many a", "myriad", "legion", "thrice"]
    
    print("Old English uses archaic quantifiers that might teach better grammar:")
    
    old_archaic_usage = {}
    for q in archaic_quantifiers:
        if q in old_data and old_data[q] > 0:
            old_archaic_usage[q] = old_data[q]
    
    if old_archaic_usage:
        print("Archaic quantifiers used in Old English:")
        for q, count in sorted(old_archaic_usage.items(), key=lambda x: x[1], reverse=True):
            freq = (count / old_words) * 1000
            print(f"  • {q:<15}: {count:>4} times ({freq:>5.2f}/1000 words)")
    
    # HYPOTHESIS 4: Context Quality - Richer Syntactic Contexts
    print("\n4. HYPOTHESIS: RICHER SYNTACTIC CONTEXTS")
    print("-" * 50)
    
    print("Old English Drama/Poetry may provide richer contexts for quantifiers:")
    avg_usage = old_variety['total_instances'] / old_variety['unique_quantifiers']
    print(f"• Average examples per quantifier: {avg_usage:.1f}")
    
    context_richness = []
    for genre, stats in variety_stats.items():
        avg = stats['total_instances'] / stats['unique_quantifiers'] if stats['unique_quantifiers'] > 0 else 0
        context_richness.append((genre, avg))
    
    context_richness.sort(key=lambda x: x[1], reverse=True)
    
    print("\nContext richness rankings (higher = more examples per quantifier):")
    for i, (genre, avg) in enumerate(context_richness, 1):
        mark = "📍" if genre == old_english else "  "
        print(f"{mark} {i}. {genre:<25}: {avg:>6.1f} examples per unique quantifier")
    
    # HYPOTHESIS 5: Training Efficiency - Fewer but Better Examples
    print("\n5. HYPOTHESIS: TRAINING EFFICIENCY")
    print("-" * 50)
    
    print("Old English might be more efficient for learning quantifiers:")
    density = (old_variety['total_instances']/old_words)*1000
    efficiency = density / old_variety['unique_quantifiers']
    print(f"• Quantifier density: {density:.2f} per 1000 words")
    print(f"• Unique quantifiers: {old_variety['unique_quantifiers']}")
    print(f"• Efficiency ratio (density/variety): {efficiency:.3f}")
    
    efficiency_scores = []
    for genre, stats in variety_stats.items():
        genre_words = results['genre_word_counts'][genre]
        density = (stats['total_instances'] / genre_words) * 1000
        efficiency = density / stats['unique_quantifiers']
        efficiency_scores.append((genre, efficiency, density, stats['unique_quantifiers']))
    
    efficiency_scores.sort(key=lambda x: x[1], reverse=True)
    
    print(f"\nTraining efficiency rankings (higher = better density-to-variety ratio):")
    for i, (genre, eff, density, variety) in enumerate(efficiency_scores, 1):
        mark = "📍" if genre == old_english else "  "
        print(f"{mark} {i}. {genre:<25}: {eff:.3f} (density: {density:.2f}, variety: {variety})")
    
    # SUMMARY OF FINDINGS
    print("\n" + "="*80)
    print("🎯 POTENTIAL EXPLANATIONS FOR BETTER MODEL PERFORMANCE")
    print("="*80)
    
    # Calculate Old English's rankings (FIXED: handle tuple structure properly)
    old_gini_rank = next(i for i, (g, _) in enumerate(gini_comparison, 1) if g == old_english)
    old_evenness_rank = next(i for i, (g, _) in enumerate(evenness_comparison, 1) if g == old_english)
    old_context_rank = next(i for i, (g, _) in enumerate(context_richness, 1) if g == old_english)
    old_efficiency_rank = next(i for i, item in enumerate(efficiency_scores, 1) if item[0] == old_english)
    
    explanations = []
    
    if old_gini_rank <= 2:
        explanations.append(f"✓ More CONSISTENT quantifier usage (Gini rank: #{old_gini_rank}/6)")
    
    if old_evenness_rank <= 2:
        explanations.append(f"✓ More BALANCED quantifier distribution (Evenness rank: #{old_evenness_rank}/6)")
    
    if old_context_rank <= 2:
        explanations.append(f"✓ RICHER contexts per quantifier (Context rank: #{old_context_rank}/6)")
    
    if old_efficiency_rank <= 2:
        explanations.append(f"✓ More EFFICIENT training signal (Efficiency rank: #{old_efficiency_rank}/6)")
    
    if len(distinctive_patterns) >= 15:
        explanations.append(f"✓ {len(distinctive_patterns)} quantifiers with DISTINCTIVE usage patterns")
    
    if old_variety['archaic_variety'] >= 8:
        explanations.append(f"✓ Uses {old_variety['archaic_variety']} ARCHAIC quantifiers (formal grammar)")
    
    print("Old English Drama/Poetry may outperform because it has:")
    for explanation in explanations:
        print(explanation)
    
    if not explanations:
        print("🤔 The basic metrics don't strongly explain the performance difference.")
        print("   This suggests more subtle factors may be at play:")
        print("   • Quality of syntactic contexts")
        print("   • Co-occurrence with other grammatical structures") 
        print("   • Sentence-level distribution patterns")
    
    print(f"\n📊 KEY INSIGHT:")
    print(f"While Old English has the LEAST quantifier variety ({old_variety['unique_quantifiers']} unique),")
    print(f"it may provide HIGHER QUALITY training examples through:")
    print(f"  • More consistent and predictable usage patterns")
    print(f"  • Distinctive grammatical contexts that generalize well")
    print(f"  • Better signal-to-noise ratio for quantifier learning")
    print(f"  • Archaic forms that encode explicit grammatical relationships")

else:
    print("ERROR: Missing analysis results. Please run the quantifier analyses first.")

print(f"\n{'='*80}")
print("CONCLUSION: Quantity ≠ Quality in Language Model Training!")

🔍 PARADOX ANALYSIS: OLD ENGLISH DRAMA/POETRY MODEL PERFORMANCE
THE PUZZLE:
--------------------
• Analysis shows Old English Drama/Poetry ranks LAST in quantifier variety
• Yet models trained on it OUTPERFORM others on quantifier tasks
• This suggests variety ≠ learnability. Let's investigate why...

HYPOTHESIS TESTING: What makes Old English Drama/Poetry special for learning?

1. HYPOTHESIS: DISTINCTIVE USAGE PATTERNS
--------------------------------------------------
Old English may have fewer quantifiers but use them in more distinctive ways:
Quantifiers with DISTINCTIVE usage patterns in Old English (71 found):
  • anybody        :  0.01x LESS than others ( 0.00 vs  0.09/1000)
  • someone        :  0.03x LESS than others ( 0.00 vs  0.04/1000)
  • everybody      :  0.04x LESS than others ( 0.00 vs  0.11/1000)
  • around         :  0.04x LESS than others ( 0.01 vs  0.23/1000)
  • aught          : 14.46x MORE than others ( 0.08 vs  0.01/1000)
  • thrice         : 12.04x MORE than othe

### 3.4 Performance Paradox Analysis

Investigation of why **Old English Drama/Poetry** produces superior language models despite having the lowest quantifier variety. The analysis suggests that training data **quality** (consistency, distinctive patterns, rich contexts) may be more important than raw linguistic variety for model performance.