# Text Processing Development - Sample Data

**Purpose:** Develop and test text processing pipeline on sample files from `eda/samples/`

**Scope:**
- Extract clean text from SRAF-XML-wrapper format
- Implement contextual chunking strategies
- **Test 12 chunk sizes** (200, 300, 400, 500, 750, 1000, 1500, 2000, 3000, 4000, 5000, 8000 tokens)
- Preserve document structure and metadata
- Determine optimal chunk size empirically (expected: 500-1000 tokens per FinGPT research)

**Data Source:** 1,375 sample files from `eda/samples/` (1993-2024 SEC filings)

**Output:** Production-ready `text_processor.py` for `src/prod/data/`

---

## 1. Setup & Dependencies

In [1]:
import sys
from pathlib import Path
import re
from collections import defaultdict
import json

# Add project root to path
project_root = Path.cwd().parent.parent
sys.path.insert(0, str(project_root))

# Sample data location
SAMPLES_DIR = project_root / 'eda' / 'samples'
print(f"[INFO] Sample directory: {SAMPLES_DIR}")
print(f"[INFO] Sample files available: {len(list(SAMPLES_DIR.glob('*.txt')))}")

[INFO] Sample directory: C:\Users\kabec\Documents\edgar_anomaly_detection\eda\samples
[INFO] Sample files available: 1375


## 2. Text Extraction from SRAF-XML-wrapper

### 2.1 Load Sample Filing

In [2]:
# Load a sample 10-K for testing
sample_files = list(SAMPLES_DIR.glob('*10-K*.txt'))
if sample_files:
    sample_path = sample_files[0]
    print(f"[INFO] Testing with: {sample_path.name}")
    
    with open(sample_path, 'r', encoding='utf-8') as f:
        raw_content = f.read()
    
    print(f"[OK] Loaded {len(raw_content):,} characters")
    print(f"\n[Preview] First 1000 chars:\n{raw_content[:1000]}")
else:
    print("[FAIL] No 10-K samples found")

[INFO] Testing with: 19931129_10-K_edgar_data_861439_0000912057-94-000263.txt
[OK] Loaded 212,900 characters

[Preview] First 1000 chars:
<Header>
<FileStats>
    <FileName>19931129_10-K_edgar_data_861439_0000912057-94-000263.txt</FileName>
    <GrossFileSize>278174</GrossFileSize>
    <NetFileSize>211739</NetFileSize>
    <NonText_DocumentType_Chars>0</NonText_DocumentType_Chars>
    <HTML_Chars>1866</HTML_Chars>
    <XBRL_Chars>0</XBRL_Chars>
    <XML_Chars>0</XML_Chars>
    <N_Exhibits>2</N_Exhibits>
</FileStats>
<SEC-Header>
0000912057-94-000263.hdr.sgml : 19950608
ACCESSION NUMBER:		0000912057-94-000263
CONFORMED SUBMISSION TYPE:	10-K
PUBLIC DOCUMENT COUNT:		3
CONFORMED PERIOD OF REPORT:	19930831
FILED AS OF DATE:		19931129
DATE AS OF CHANGE:		19931129
SROS:			NONE

FILER:

	COMPANY DATA:	
		COMPANY CONFORMED NAME:			AMERICAN MEDICAL HOLDINGS INC
		CENTRAL INDEX KEY:			0000861439
		STANDARD INDUSTRIAL CLASSIFICATION:	8060
		IRS NUMBER:				133527632
		STATE OF INCORPORATION:			DE
	

### 2.2 Parse SRAF Structure

**What we're doing here:**
SEC filings in SRAF format have a predictable structure with metadata and content separated into sections:

```xml
<Header>
  <FileStats>...</FileStats>  <!-- File size, exhibit counts -->
</Header>
<SEC-Header>                   <!-- Company info, CIK, filing date -->
  ACCESSION-NUMBER: ...
  CONFORMED-NAME: ...
  CIK: ...
  FORM-TYPE: ...
</SEC-Header>
<TEXT>...</TEXT>                <!-- Actual filing content -->
```

**Why extract metadata separately?**
- We need company name, CIK, and filing date to add context to each chunk
- This metadata helps the LLM understand "who" and "when" for each piece of text
- Enables filtering queries like "Show me all Apple 10-Ks from 2020-2024"

**Why remove wrapper tags?**
- The `<Header>` and `<SEC-Header>` sections contain technical filing metadata, not business content
- We want clean text for chunking and embedding
- HTML/XBRL tags are markup, not content - they interfere with semantic understanding

In [3]:
def extract_sraf_metadata(content):
    """
    Extract metadata from SRAF header
    
    Returns:
        dict: Metadata fields (CIK, company name, form type, date, etc.)
    """
    metadata = {}
    
    # Extract SEC-Header section
    sec_header_match = re.search(r'<SEC-Header>(.*?)</SEC-Header>', content, re.DOTALL | re.IGNORECASE)
    if sec_header_match:
        sec_header = sec_header_match.group(1)
        
        # Field mappings: (output_key, regex_patterns)
        # Try multiple patterns for each field to handle format variations
        field_mappings = {
            'COMPANY_NAME': [
                r'COMPANY CONFORMED NAME:\s*(.+?)(?:\n|$)',
                r'CONFORMED-NAME:\s*(.+?)(?:\n|$)',
                r'CONFORMED NAME:\s*(.+?)(?:\n|$)'
            ],
            'CIK': [
                r'CENTRAL INDEX KEY:\s*(.+?)(?:\n|$)',
                r'CIK:\s*(.+?)(?:\n|$)'
            ],
            'FORM_TYPE': [
                r'FORM TYPE:\s*(.+?)(?:\n|$)',
                r'FORM-TYPE:\s*(.+?)(?:\n|$)',
                r'CONFORMED SUBMISSION TYPE:\s*(.+?)(?:\n|$)'
            ],
            'FILING_DATE': [
                r'FILED AS OF DATE:\s*(.+?)(?:\n|$)',
                r'FILED-AS-OF-DATE:\s*(.+?)(?:\n|$)',
                r'DATE AS OF CHANGE:\s*(.+?)(?:\n|$)'
            ],
            'ACCESSION_NUMBER': [
                r'ACCESSION NUMBER:\s*(.+?)(?:\n|$)',
                r'ACCESSION-NUMBER:\s*(.+?)(?:\n|$)'
            ],
            'PERIOD_OF_REPORT': [
                r'CONFORMED PERIOD OF REPORT:\s*(.+?)(?:\n|$)',
                r'CONFORMED-PERIOD-OF-REPORT:\s*(.+?)(?:\n|$)'
            ]
        }
        
        # Try each pattern until we find a match
        for field, patterns in field_mappings.items():
            for pattern in patterns:
                match = re.search(pattern, sec_header, re.IGNORECASE)
                if match:
                    metadata[field] = match.group(1).strip()
                    break  # Found match, move to next field
    
    return metadata


def extract_clean_text(content):
    """
    Extract clean text content from SRAF-XML-wrapper
    
    Returns:
        str: Clean text with HTML/XML tags removed
    """
    # Remove SRAF wrapper tags
    text = re.sub(r'<Header>.*?</Header>', '', content, flags=re.DOTALL | re.IGNORECASE)
    text = re.sub(r'<SEC-Header>.*?</SEC-Header>', '', text, flags=re.DOTALL | re.IGNORECASE)
    
    # Remove HTML tags
    text = re.sub(r'<[^>]+>', ' ', text)
    
    # Remove XBRL tags
    text = re.sub(r'<[^>]*xbrl[^>]*>.*?</[^>]*xbrl[^>]*>', '', text, flags=re.DOTALL | re.IGNORECASE)
    
    # Clean up whitespace
    text = re.sub(r'\s+', ' ', text)  # Multiple spaces to single
    text = re.sub(r'\n\s*\n', '\n\n', text)  # Multiple newlines to double
    
    return text.strip()


# Test extraction
metadata = extract_sraf_metadata(raw_content)
clean_text = extract_clean_text(raw_content)

print("[OK] Extracted metadata:")
for key, value in metadata.items():
    print(f"  {key}: {value}")

print(f"\n[OK] Clean text length: {len(clean_text):,} characters")
print(f"\n[Preview] First 500 chars of clean text:\n{clean_text[:500]}")

[OK] Extracted metadata:
  COMPANY_NAME: AMERICAN MEDICAL HOLDINGS INC
  CIK: 0000861439
  FORM_TYPE: 10-K
  FILING_DATE: 19931129
  ACCESSION_NUMBER: 0000912057-94-000263
  PERIOD_OF_REPORT: 19930831

[OK] Clean text length: 209,491 characters

[Preview] First 500 chars of clean text:
Proc-Type: 2001,MIC-CLEAR Originator-Name: keymaster@town.hall.org Originator-Key-Asymmetric: MFkwCgYEVQgBAQICAgADSwAwSAJBALeWW4xDV4i7+b6+UyPn5RtObb1cJ7VkACDq pKb9/DClgTKIm08lCfoilvi9Wl4SODbR1+1waHhiGmeZO8OdgLUCAwEAAQ== MIC-Info: RSA-MD5,RSA, jSme4OE5puXgBpdHHyga1WdDJ0E3trqOOdfp13QPWNizEt4YLMTbUPjitjQi47a9 tBwulFatOU1F7uc/UNiQZQ== 0000912057-94-000263.txt : 19950608 10-K 1 10-K - - - - -------------------------------------------------------------------------------- - - - - ----------------------


## 3. Chunking Strategies

**Why chunking matters:**
Our filings average ~50K tokens (from EDA), but:
- Embedding models have context limits (512-8192 tokens depending on model)
- LLMs have context windows (we need to fit multiple chunks in one query)
- Retrieval precision: smaller chunks = more targeted results

**Our comprehensive test: 12 chunk sizes (200-8000 tokens)**
- **200-500 tokens:** FinGPT optimal range for financial fact extraction
- **750-2000 tokens:** Standard RAG range for balanced context
- **3000-8000 tokens:** Large context testing (likely too broad)

**Expected winner:** 500-1000 tokens based on FinGPT research on SEC filings

### 3.1 Token Counting

**What's a token?**
- Not the same as words! "running" = 1 token, but "anthropomorphic" = 3 tokens
- Average: ~0.75 tokens per word in English
- LLMs and embedding models operate on tokens, not characters or words

**Why use tiktoken?**
- Official OpenAI tokenizer (same as GPT-3.5/4)
- More accurate than "characters/4" estimation
- Essential for staying within model limits

In [4]:
# Install tiktoken if needed
try:
    import tiktoken
except ImportError:
    print("[INFO] Installing tiktoken...")
    !pip install tiktoken
    import tiktoken

# Initialize tokenizer (cl100k_base used by GPT-3.5/4)
tokenizer = tiktoken.get_encoding("cl100k_base")

def count_tokens(text):
    """Count tokens in text using tiktoken"""
    return len(tokenizer.encode(text))

# Estimate tokens in sample
token_count = count_tokens(clean_text)
print(f"[OK] Sample filing: {token_count:,} tokens")
print(f"\n[INFO] Estimated chunks for different sizes:")
print(f"  200 tokens: ~{token_count // 200} chunks")
print(f"  500 tokens: ~{token_count // 500} chunks (FinGPT optimal)")
print(f"  1000 tokens: ~{token_count // 1000} chunks (FinGPT optimal)")
print(f"  2000 tokens: ~{token_count // 2000} chunks")
print(f"  5000 tokens: ~{token_count // 5000} chunks")
print(f"  8000 tokens: ~{token_count // 8000} chunks")

[OK] Sample filing: 45,247 tokens

[INFO] Estimated chunks for different sizes:
  200 tokens: ~226 chunks
  500 tokens: ~90 chunks (FinGPT optimal)
  1000 tokens: ~45 chunks (FinGPT optimal)
  2000 tokens: ~22 chunks
  5000 tokens: ~9 chunks
  8000 tokens: ~5 chunks


### 3.2 Fixed-Size Chunking with Overlap

**How this works:**
1. Convert entire text to tokens (list of integers)
2. Slice tokens into chunks of `chunk_size` (e.g., 500 tokens)
3. Move forward by `chunk_size - overlap` to create the next chunk
4. Decode tokens back to text

**Why overlap?**
Without overlap, information at chunk boundaries gets split awkwardly:

```
Chunk 1: "...revenue increased due to strong demand for our new prod-"
Chunk 2: "-uct line, particularly in the Asia-Pacific region..."
```

With overlap (~10% of chunk size):
```
Chunk 1: "...revenue increased due to strong demand for our new product line, particularly in the Asia-Pacific region..."
Chunk 2: "...our new product line, particularly in the Asia-Pacific region, which saw 45% growth..."
```

**Overlap strategy:**
- We use 10% overlap for all chunk sizes
- 200 tokens → 20 overlap, 500 tokens → 50 overlap, 1000 tokens → 100 overlap
- Research shows 10-20% overlap optimal for preserving context without excessive redundancy

In [5]:
def chunk_by_tokens(text, chunk_size=2000, overlap=200):
    """
    Chunk text by token count with overlap
    
    Args:
        text: Input text
        chunk_size: Target tokens per chunk
        overlap: Token overlap between chunks
    
    Returns:
        list: List of text chunks
    """
    tokens = tokenizer.encode(text)
    chunks = []
    
    start = 0
    while start < len(tokens):
        end = min(start + chunk_size, len(tokens))
        chunk_tokens = tokens[start:end]
        chunk_text = tokenizer.decode(chunk_tokens)
        chunks.append(chunk_text)
        
        # Move start forward by (chunk_size - overlap)
        start += (chunk_size - overlap)
    
    return chunks


# Test ALL 12 chunk sizes with 10% overlap
CHUNK_SIZES = [200, 300, 400, 500, 750, 1000, 1500, 2000, 3000, 4000, 5000, 8000]

print("[INFO] Testing 12 different chunk sizes:\n")

chunking_results = {}

for chunk_size in CHUNK_SIZES:
    overlap = int(chunk_size * 0.1)  # 10% overlap
    chunks = chunk_by_tokens(clean_text, chunk_size=chunk_size, overlap=overlap)
    
    chunking_results[chunk_size] = {
        'num_chunks': len(chunks),
        'overlap': overlap,
        'chunks': chunks
    }
    
    print(f"[OK] {chunk_size:>5} tokens: {len(chunks):>3} chunks (overlap: {overlap} tokens)")

print(f"\n[Summary] Tested chunk sizes from {min(CHUNK_SIZES)} to {max(CHUNK_SIZES)} tokens")
print(f"[Info] Sample filing has {token_count:,} tokens total")

# Preview first chunk from smallest and largest sizes
print(f"\n[Preview] First chunk (200 tokens):\n{chunking_results[200]['chunks'][0][:300]}...\n")
print(f"[Preview] First chunk (8000 tokens):\n{chunking_results[8000]['chunks'][0][:300]}...")

[INFO] Testing 12 different chunk sizes:

[OK]   200 tokens: 252 chunks (overlap: 20 tokens)
[OK]   300 tokens: 168 chunks (overlap: 30 tokens)
[OK]   400 tokens: 126 chunks (overlap: 40 tokens)
[OK]   500 tokens: 101 chunks (overlap: 50 tokens)
[OK]   750 tokens:  68 chunks (overlap: 75 tokens)
[OK]  1000 tokens:  51 chunks (overlap: 100 tokens)
[OK]  1500 tokens:  34 chunks (overlap: 150 tokens)
[OK]  2000 tokens:  26 chunks (overlap: 200 tokens)
[OK]  3000 tokens:  17 chunks (overlap: 300 tokens)
[OK]  4000 tokens:  13 chunks (overlap: 400 tokens)
[OK]  5000 tokens:  11 chunks (overlap: 500 tokens)
[OK]  8000 tokens:   7 chunks (overlap: 800 tokens)

[Summary] Tested chunk sizes from 200 to 8000 tokens
[Info] Sample filing has 45,247 tokens total

[Preview] First chunk (200 tokens):
Proc-Type: 2001,MIC-CLEAR Originator-Name: keymaster@town.hall.org Originator-Key-Asymmetric: MFkwCgYEVQgBAQICAgADSwAwSAJBALeWW4xDV4i7+b6+UyPn5RtObb1cJ7VkACDq pKb9/DClgTKIm08lCfoilvi9Wl4SODbR1+1waHhiGmeZ

### 3.3 Contextual Chunking

**The problem with raw chunks:**
When you retrieve a chunk from the vector database, it's just text with no context:
```
"Revenue increased 15% year-over-year to $2.3B..."
```

Questions the LLM can't answer:
- Which company is this?
- What year?
- Is this a 10-K or 10-Q?

**Contextual chunking solution:**
Add a header to EVERY chunk with document metadata:
```
Document: Apple Inc. (10-K) filed 2024-01-15 [CIK: 0000320193]

Revenue increased 15% year-over-year to $2.3B...
```

**Why this matters for RAG:**
1. **Better retrieval:** Embedding models encode the metadata too, improving matching
2. **LLM context:** The LLM knows exactly what document it's reading
3. **Multi-document queries:** "Compare Apple vs Microsoft revenue" - LLM can distinguish chunks
4. **Citation:** System can cite exactly which filing a fact came from

**Trade-off:**
- Adds ~50-100 tokens per chunk (context header)
- But dramatically improves retrieval quality and answer accuracy
- Worth it for complex queries across multiple filings

In [6]:
def create_contextual_chunks(chunks, metadata):
    """
    Add document context to each chunk
    
    Args:
        chunks: List of text chunks
        metadata: Document metadata dict
    
    Returns:
        list: List of dicts with chunk text and metadata
    """
    contextual_chunks = []
    
    # Create context header (use new field names)
    company = metadata.get('COMPANY_NAME', 'Unknown Company')
    form_type = metadata.get('FORM_TYPE', 'Unknown Form')
    filing_date = metadata.get('FILING_DATE', 'Unknown Date')
    cik = metadata.get('CIK', 'Unknown CIK')
    
    context_header = f"Document: {company} ({form_type}) filed {filing_date} [CIK: {cik}]\n\n"
    
    for i, chunk in enumerate(chunks):
        contextual_chunks.append({
            'chunk_id': i,
            'text': context_header + chunk,
            'metadata': {
                'company': company,
                'form_type': form_type,
                'filing_date': filing_date,
                'cik': cik,
                'chunk_index': i,
                'total_chunks': len(chunks)
            }
        })
    
    return contextual_chunks


# Create contextual chunks for all 12 chunk sizes
print("[INFO] Creating contextual chunks for all chunk sizes:\n")

contextual_results = {}

for chunk_size in CHUNK_SIZES:
    chunks = chunking_results[chunk_size]['chunks']
    contextual_chunks = create_contextual_chunks(chunks, metadata)
    contextual_results[chunk_size] = contextual_chunks
    print(f"[OK] {chunk_size:>5} tokens: {len(contextual_chunks)} contextual chunks created")

# Preview contextual chunk from 500 tokens (likely optimal per FinGPT)
print(f"\n[Preview] Contextual chunk example (500 tokens):")
print(f"{contextual_results[500][0]['text'][:600]}...")
print(f"\n[Metadata]:\n{json.dumps(contextual_results[500][0]['metadata'], indent=2)}")

[INFO] Creating contextual chunks for all chunk sizes:

[OK]   200 tokens: 252 contextual chunks created
[OK]   300 tokens: 168 contextual chunks created
[OK]   400 tokens: 126 contextual chunks created
[OK]   500 tokens: 101 contextual chunks created
[OK]   750 tokens: 68 contextual chunks created
[OK]  1000 tokens: 51 contextual chunks created
[OK]  1500 tokens: 34 contextual chunks created
[OK]  2000 tokens: 26 contextual chunks created
[OK]  3000 tokens: 17 contextual chunks created
[OK]  4000 tokens: 13 contextual chunks created
[OK]  5000 tokens: 11 contextual chunks created
[OK]  8000 tokens: 7 contextual chunks created

[Preview] Contextual chunk example (500 tokens):
Document: AMERICAN MEDICAL HOLDINGS INC (10-K) filed 19931129 [CIK: 0000861439]

Proc-Type: 2001,MIC-CLEAR Originator-Name: keymaster@town.hall.org Originator-Key-Asymmetric: MFkwCgYEVQgBAQICAgADSwAwSAJBALeWW4xDV4i7+b6+UyPn5RtObb1cJ7VkACDq pKb9/DClgTKIm08lCfoilvi9Wl4SODbR1+1waHhiGmeZO8OdgLUCAwEAAQ== MIC-Info: RSA-

## 4. Batch Processing All Samples

In [7]:
def process_filing(file_path, chunk_size=2000, overlap=None):
    """
    Complete processing pipeline for a single filing
    
    Args:
        file_path: Path to filing
        chunk_size: Target tokens per chunk
        overlap: Token overlap (defaults to 10% of chunk_size)
    
    Returns:
        dict: Processed filing with chunks and metadata
    """
    if overlap is None:
        overlap = int(chunk_size * 0.1)  # Default 10% overlap
    
    # Load file
    with open(file_path, 'r', encoding='utf-8') as f:
        raw_content = f.read()
    
    # Extract metadata and clean text
    metadata = extract_sraf_metadata(raw_content)
    clean_text = extract_clean_text(raw_content)
    
    # Create chunks
    chunks = chunk_by_tokens(clean_text, chunk_size, overlap)
    
    # Add context
    contextual_chunks = create_contextual_chunks(chunks, metadata)
    
    return {
        'file_name': file_path.name,
        'metadata': metadata,
        'total_tokens': count_tokens(clean_text),
        'chunk_size': chunk_size,
        'overlap': overlap,
        'num_chunks': len(chunks),
        'chunks': contextual_chunks
    }


# Process all samples with ALL 12 chunk sizes
print("[INFO] Processing all sample files with 12 different chunk sizes...")
print(f"[INFO] Total files: {len(list(SAMPLES_DIR.glob('*.txt')))}\n")

all_results = {chunk_size: [] for chunk_size in CHUNK_SIZES}

sample_count = len(list(SAMPLES_DIR.glob('*.txt')))

for chunk_size in CHUNK_SIZES:
    print(f"Processing {chunk_size:>5} token chunks... ", end='', flush=True)
    
    for sample_path in SAMPLES_DIR.glob('*.txt'):
        try:
            result = process_filing(sample_path, chunk_size=chunk_size)
            all_results[chunk_size].append(result)
        except Exception as e:
            print(f"\n[FAIL] {sample_path.name}: {str(e)}")
    
    print(f"Done! ({len(all_results[chunk_size])}/{sample_count} files)")

print(f"\n{'='*80}")
print(f"[COMPLETE] All 12 chunk sizes processed successfully!")
print(f"{'='*80}")

[INFO] Processing all sample files with 12 different chunk sizes...
[INFO] Total files: 1375

Processing   200 token chunks... Done! (1375/1375 files)
Processing   300 token chunks... Done! (1375/1375 files)
Processing   400 token chunks... Done! (1375/1375 files)
Processing   500 token chunks... Done! (1375/1375 files)
Processing   750 token chunks... Done! (1375/1375 files)
Processing  1000 token chunks... Done! (1375/1375 files)
Processing  1500 token chunks... Done! (1375/1375 files)
Processing  2000 token chunks... Done! (1375/1375 files)
Processing  3000 token chunks... Done! (1375/1375 files)
Processing  4000 token chunks... Done! (1375/1375 files)
Processing  5000 token chunks... Done! (1375/1375 files)
Processing  8000 token chunks... Done! (1375/1375 files)

[COMPLETE] All 12 chunk sizes processed successfully!


## 5. Summary Statistics

In [8]:
import pandas as pd

# Create comprehensive comparison across all 12 chunk sizes
print("[INFO] Generating comprehensive statistics for all chunk sizes...\n")

comparison_stats = []

for chunk_size in CHUNK_SIZES:
    filings = all_results[chunk_size]
    
    total_chunks = sum(f['num_chunks'] for f in filings)
    avg_chunks = total_chunks / len(filings) if filings else 0
    avg_tokens = sum(f['total_tokens'] for f in filings) / len(filings) if filings else 0
    
    # Estimate storage (assuming 1536-dimensional embeddings at 4 bytes/float)
    embedding_size_mb = (total_chunks * 1536 * 4) / (1024 * 1024)
    
    comparison_stats.append({
        'chunk_size': chunk_size,
        'overlap': int(chunk_size * 0.1),
        'total_chunks': total_chunks,
        'avg_chunks_per_filing': round(avg_chunks, 1),
        'avg_tokens_per_filing': int(avg_tokens),
        'estimated_storage_mb': round(embedding_size_mb, 2)
    })

df_comparison = pd.DataFrame(comparison_stats)

print("="*80)
print("CHUNK SIZE COMPARISON - ALL 12 SIZES")
print("="*80)
print(df_comparison.to_string(index=False))

print(f"\n{'='*80}")
print("KEY INSIGHTS:")
print(f"{'='*80}")
print(f"Total files processed: {len(all_results[CHUNK_SIZES[0]])}")
print(f"Chunk sizes tested: {len(CHUNK_SIZES)}")
print(f"\nSmallest chunks (200 tokens): {df_comparison.iloc[0]['total_chunks']:,} total chunks")
print(f"Largest chunks (8000 tokens): {df_comparison.iloc[-1]['total_chunks']:,} total chunks")
print(f"\nStorage range: {df_comparison['estimated_storage_mb'].min():.1f} MB - {df_comparison['estimated_storage_mb'].max():.1f} MB")
print(f"\nExpected optimal (FinGPT research): 500-1000 token chunks")

# Highlight 500 and 1000 token stats
print(f"\n{'='*80}")
print("FINGPT RECOMMENDED RANGE:")
print(f"{'='*80}")
for size in [500, 1000]:
    row = df_comparison[df_comparison['chunk_size'] == size].iloc[0]
    print(f"\n{size} tokens:")
    print(f"  Total chunks: {row['total_chunks']:,}")
    print(f"  Avg chunks/filing: {row['avg_chunks_per_filing']}")
    print(f"  Storage: {row['estimated_storage_mb']:.2f} MB")

[INFO] Generating comprehensive statistics for all chunk sizes...

CHUNK SIZE COMPARISON - ALL 12 SIZES
 chunk_size  overlap  total_chunks  avg_chunks_per_filing  avg_tokens_per_filing  estimated_storage_mb
        200       20        381933                  277.8                  49910               2237.89
        300       30        254864                  185.4                  49910               1493.34
        400       40        191315                  139.1                  49910               1120.99
        500       50        153207                  111.4                  49910                897.70
        750       75        102338                   74.4                  49910                599.64
       1000      100         76953                   56.0                  49910                450.90
       1500      150         51515                   37.5                  49910                301.85
       2000      200         38820                   28.2               

## 6. Export Results

Save processed chunks for embedding generation testing

In [9]:
# Create output directory
output_dir = project_root / 'notebooks' / 'prototyping' / 'output'
output_dir.mkdir(exist_ok=True)

print("[INFO] Saving processed chunks for all 12 chunk sizes...\n")

# Save each chunk size separately for easy comparison
saved_files = []

for chunk_size in CHUNK_SIZES:
    output_path = output_dir / f'processed_samples_{chunk_size}tok.json'
    
    with open(output_path, 'w', encoding='utf-8') as f:
        json.dump(all_results[chunk_size], f, indent=2)
    
    file_size_mb = output_path.stat().st_size / (1024*1024)
    saved_files.append({
        'chunk_size': chunk_size,
        'filename': output_path.name,
        'size_mb': round(file_size_mb, 2)
    })
    
    print(f"[OK] {chunk_size:>5} tokens: {output_path.name} ({file_size_mb:.2f} MB)")

# Save comparison summary
summary_path = output_dir / 'chunk_size_comparison.csv'
df_comparison.to_csv(summary_path, index=False)
print(f"\n[OK] Saved comparison summary: {summary_path.name}")

# Clean up old files
old_file = output_dir / 'processed_samples_2k.json'
if old_file.exists():
    old_file.unlink()
    print(f"[INFO] Removed old file: processed_samples_2k.json")

print(f"\n{'='*80}")
print("EXPORT COMPLETE")
print(f"{'='*80}")
print(f"Total files saved: {len(saved_files)} JSON files + 1 CSV summary")
print(f"Output directory: {output_dir}")
print(f"\nFiles created:")
print(f"  - 12 JSON files: processed_samples_200tok.json through processed_samples_8000tok.json")
print(f"  - 1 CSV summary: chunk_size_comparison.csv")
print(f"\nNext steps:")
print(f"1. Review chunk coherence for different sizes")
print(f"2. Create test queries for retrieval evaluation")
print(f"3. Test embedding generation (next notebook: 02_embedding_tests.ipynb)")
print(f"4. Determine optimal chunk size empirically")

[INFO] Saving processed chunks for all 12 chunk sizes...

[OK]   200 tokens: processed_samples_200tok.json (482.48 MB)
[OK]   300 tokens: processed_samples_300tok.json (435.41 MB)
[OK]   400 tokens: processed_samples_400tok.json (411.89 MB)
[OK]   500 tokens: processed_samples_500tok.json (397.77 MB)
[OK]   750 tokens: processed_samples_750tok.json (378.93 MB)
[OK]  1000 tokens: processed_samples_1000tok.json (369.44 MB)
[OK]  1500 tokens: processed_samples_1500tok.json (359.97 MB)
[OK]  2000 tokens: processed_samples_2000tok.json (355.11 MB)
[OK]  3000 tokens: processed_samples_3000tok.json (350.13 MB)
[OK]  4000 tokens: processed_samples_4000tok.json (347.46 MB)
[OK]  5000 tokens: processed_samples_5000tok.json (345.65 MB)
[OK]  8000 tokens: processed_samples_8000tok.json (342.44 MB)

[OK] Saved comparison summary: chunk_size_comparison.csv

EXPORT COMPLETE
Total files saved: 12 JSON files + 1 CSV summary
Output directory: C:\Users\kabec\Documents\edgar_anomaly_detection\notebooks\pr

## 7. Next Steps

### Completed ✅
- Extracted clean text from SRAF-XML format
- Tested **12 different chunk sizes** (200, 300, 400, 500, 750, 1000, 1500, 2000, 3000, 4000, 5000, 8000 tokens)
- Created contextual chunks with document metadata
- Processed all 1,375 sample files with all chunk sizes
- Generated comprehensive comparison statistics
- Exported 12 JSON files (one per chunk size) + CSV summary

### Next Steps 🔄

**1. Chunk Coherence Review (Manual - ~30 minutes):**
   - Inspect sample chunks from each of the 12 sizes
   - Check: Are 200-token chunks too fragmented?
   - Check: Are 8000-token chunks too broad?
   - Identify which sizes maintain semantic coherence
   - Focus on FinGPT range: 500-1000 tokens

**2. Retrieval Quality Evaluation (`02_chunking_strategies.ipynb`):**
   - Define 20 test questions with known answers
   - Examples:
     - "What was revenue in Q3?" (fact extraction → favors small chunks)
     - "Explain cyber risk factors" (narrative → favors larger chunks)
   - Run retrieval tests on **all 12 chunk sizes**
   - Measure: precision (did we retrieve answer?), accuracy (did LLM answer correctly?), rank (position of best chunk)
   - Compare performance across all sizes
   - Expected winner: 500-1000 tokens (FinGPT research)

**3. Embedding Generation Tests (`03_embedding_tests.ipynb`):**
   - Test embedding models: nomic-embed-text-v1.5 (8192 limit), BGE-large (512 limit)
   - Generate embeddings for chunks from top 3-5 chunk sizes (after narrowing down)
   - Compare embedding quality and retrieval performance

**4. Determine Optimal Chunk Size (Empirical):**
   - Based on FinGPT research: expect 500-1000 tokens to win for fact-based queries
   - Validate with our test queries and data
   - May find different sizes optimal for different query types:
     - Fact extraction: Likely 200-500 tokens
     - Narrative understanding: Likely 1000-2000 tokens
     - Complex reasoning: Possibly 2000-4000 tokens

**5. Migrate to Production (`src/prod/data/text_processor.py`):**
   - Copy winning chunking strategy (or top 2-3 if query-type dependent)
   - Add error handling and logging
   - Optimize for batch processing
   - Create unit tests

**6. Scale to Full Dataset:**
   - Process all 1,375 files with optimal chunk size
   - Generate embeddings
   - Store in vector database (Qdrant/Weaviate)

---

### Reference Files:

**Processed chunks (12 JSON files):**
- `output/processed_samples_200tok.json` through `output/processed_samples_8000tok.json`
- Each file contains all 1,375 filings chunked at that size

**Comparison summary:**
- `output/chunk_size_comparison.csv` - Statistics for all 12 chunk sizes
  - Total chunks per size
  - Average chunks per filing
  - Storage requirements
  - Overlap strategy

**Expected optimal range (FinGPT):** 500-1000 tokens  
**Validation method:** Retrieval quality metrics in next notebook