# From Raw SEC Filings to LLM-Ready Markdown

**How messy SEC HTML becomes clean, structured Markdown for retrieval and reasoning.**

SEC filings are built for browsers, not models. Tables are complex and get completely garbled by standard parsers, XBRL tags pollute the text, and 200+ page documents have no clear structure. Let's see how `sec2md` transforms that chaos into clean, structured Markdown that an AI can reason over.

## Setup

The SEC requires a user agent string. `edgartools` handles that for us.

In [None]:
# Install if needed
# !pip install edgartools sec2md

from edgar import Company, set_identity
import sec2md
from IPython.display import Markdown, HTML, display

# Set your identity for SEC
set_identity("Your Name <you@example.com>")

## 1. The Problem: Raw SEC HTML is Unusable

Let's pull Apple's latest 10-K and see what we're dealing with.

In [None]:
company = Company("AAPL")
filing = company.get_filings(form="10-K").latest()

print(f"Filing: {filing.form} - {filing.filing_date}")
print(f"Period: {filing.report_date}")

### What Does the Raw HTML Look Like?

This is what standard parsers have to deal with — complex tables with nested structures, absolutely positioned `<div>` elements, inline styles, and XBRL tags everywhere.

In [None]:
html = filing.html()

# Show raw HTML
print("Raw HTML (first 1000 chars):")
print("=" * 80)
print(html[5000:6000])  # Skip header boilerplate
print("=" * 80)

### The Problem: Tables Get Completely Garbled

Standard HTML-to-text parsers can't handle the complexity of SEC tables. They don't just "break" on edge cases — **all tables get garbled**. Let's see what a simple library would produce:

In [None]:
# What a naive parser produces
from bs4 import BeautifulSoup

soup = BeautifulSoup(html, 'html.parser')
naive_text = soup.get_text(separator=' ', strip=True)

print("Naive HTML-to-text output (first 1000 chars):")
print("=" * 80)
print(naive_text[:1000])
print("=" * 80)
print("\nNotice: No structure, no tables, XBRL tags mixed in, whitespace chaos.")

## 2. The Solution: Clean, Structured Markdown

`sec2md` rebuilds the document structure from scratch. Tables become pipe-formatted Markdown, headers preserve hierarchy, and XBRL tags vanish. This isn't cosmetic — **it changes how a model can understand the document**.

In [None]:
md = sec2md.convert_to_markdown(html)

print(f"Total length: {len(md):,} characters")
print(f"\nFirst 1000 chars (raw):")
print("=" * 80)
print(md[:1000])
print("=" * 80)

### See the Difference: Rendered Markdown

Now let's see what this looks like when rendered (compare to the raw HTML above):

In [None]:
display(Markdown(md[:2000]))

**Clean headers, preserved structure, readable tables.** This is what makes LLM reasoning possible.

## 3. Section Awareness: Your Agent Doesn't Need to Search Everything

A 10-K is modular: *Item 1. Business*, *Item 1A. Risk Factors*, *Item 7. MD&A*. Section awareness means your agent doesn't have to search the entire 200-page document — **it can target exactly where a fact belongs**.

In [None]:
# Convert with page tracking
pages = sec2md.convert_to_markdown(html, return_pages=True)

print(f"Total pages: {len(pages)}")
print(f"\nFirst page preview:")
print(pages[0].content[:300])

### Extract All Sections

In [None]:
sections = sec2md.extract_sections(pages, filing_type="10-K")

print(f"Total sections found: {len(sections)}\n")
print("Section breakdown:")
for section in sections:
    print(f"  {section.item}: {section.item_title} (Pages {section.page_range[0]}-{section.page_range[1]})")

### Get a Specific Section: Risk Factors

Instead of retrieving from 59 pages, we can target the exact section we need.

In [None]:
risk = sec2md.get_section(sections, sec2md.Item10K.RISK_FACTORS)

if risk:
    print(f"Section: {risk.item} - {risk.item_title}")
    print(f"Page range: {risk.page_range}")
    print(f"Total pages in section: {len(risk.pages)}")
    print(f"Total chars: {len(risk.markdown()):,}")
    print(f"\nFirst 500 chars (raw):")
    print(risk.markdown()[:500])
else:
    print("Risk Factors section not found")

### Rendered Section

In [None]:
display(Markdown(risk.markdown()[:1500]))

## 4. Chunking for Retrieval: Semantic Atoms with Context

Each chunk keeps page numbers, section titles, and headers. When embedded, these become **semantic "atoms" for retrieval** — a model can cite specific pages or sections instead of hallucinating context.

### Chunk the Risk Factors Section

In [None]:
if risk:
    # Build header for better embeddings
    header = f"""# Apple Inc. (AAPL - NASDAQ)
Sector: Technology | Industry: Consumer Electronics
Form 10-K | FY 2024 | Filed: {filing.filing_date}

## Risk Factors
"""
    
    # Chunk with header
    risk_chunks = sec2md.chunk_section(risk, chunk_size=512, header=header)
    
    print(f"Risk Factors chunks: {len(risk_chunks)}")
    print(f"\nFirst 3 chunks:")
    for i, chunk in enumerate(risk_chunks[:3]):
        print(f"\n{'='*80}")
        print(f"Chunk {i+1} | Page {chunk.page} | {chunk.num_tokens} tokens | Has table: {chunk.has_table}")
        print(f"{'='*80}")
        print(chunk.content[:300] + "...")

### Embedding Text (with Header)

The `embedding_text` field includes the header for better semantic search, but the `content` field keeps only the actual filing text.

In [None]:
if risk_chunks:
    print("First chunk's embedding_text (includes header):")
    print("=" * 80)
    print(risk_chunks[0].embedding_text[:600])
    print("...")

## 5. Putting It Together: Simple Keyword Search

Instead of brute-forcing 200 pages, retrieval now happens over meaningful segments with built-in context. Let's mock a simple keyword search:

In [None]:
query = "supply chain"
matches = [c for c in risk_chunks if query.lower() in c.content.lower()]

print(f"Found {len(matches)} chunks mentioning '{query}'\n")

for i, m in enumerate(matches[:2]):
    print(f"\n{'='*80}")
    print(f"Match {i+1} | Page {m.page}")
    print(f"{'='*80}")
    display(Markdown(m.content[:400] + "..."))

## 6. Bonus: Check Table Preservation

Let's verify that complex tables are preserved in Markdown format.

In [None]:
# Chunk all pages to find tables
all_chunks = sec2md.chunk_pages(pages, chunk_size=512, chunk_overlap=128)

table_chunks = [c for c in all_chunks if c.has_table]

print(f"Chunks with tables: {len(table_chunks)} / {len(all_chunks)}")

if table_chunks:
    print(f"\nFirst table chunk (Page {table_chunks[0].page}):")
    print("=" * 80)
    display(Markdown(table_chunks[0].content[:600]))

## Conclusion

We started with unreadable HTML where **all tables get garbled by standard parsers**, and ended with:

- ✅ Clean, structured Markdown
- ✅ Preserved tables in pipe format
- ✅ Section-aware extraction (ITEM 1, 1A, 7, etc.)
- ✅ Page-aware chunks with metadata headers
- ✅ Ready for embeddings and retrieval

`sec2md` doesn't just clean data — **it gives your AI a map**.

### Next Steps

Try it on different document types:
- 10-Q filings
- 8-K press releases
- Proxy statements (DEF 14A)
- Merger exhibits and material contracts

You'll see the same structural preservation across all SEC document types.