## Benefits of preserving text structure:

- Better chunking that respects semantic boundaries
- More accurate search and retrieval
- Improved question answering by maintaining context
- The ability to handle structured data like tables


In [1]:
# pip install llama-index-core
from llama_index.core import Document
from llama_index.core.node_parser import MarkdownNodeParser
import textwrap

# Sample markdown document with clear structure
markdown_text = """
# AI Engineering Fundamentals

## Introduction to Vector Databases

Vector databases are specialized database systems designed to store and query vector embeddings efficiently.

### Key Advantages
- Efficient similarity search
- Scalable to billions of vectors
- Support for metadata filtering

### Common Operations
1. **Vector Indexing**: Creating data structures for efficient search
2. **Approximate Nearest Neighbor Search**: Finding similar vectors quickly
3. **Hybrid Search**: Combining vector similarity with metadata filters

## Working with Embeddings

Embeddings are dense numerical representations of data that capture semantic meaning.

### Popular Embedding Models
- OpenAI text-embedding-ada-002
- Sentence Transformers
- CLIP for image embeddings
"""

# Create a document
document = Document(text=markdown_text)

# Create a parser that recognizes markdown structure
markdown_parser = MarkdownNodeParser()

# Parse the document
nodes = markdown_parser.get_nodes_from_documents([document])

# Display the resulting nodes
print(f"Total nodes created: {len(nodes)}")
for i, node in enumerate(nodes):
    print(f"\nNode {i+1}:")
    print(f"Text: {textwrap.shorten(node.text, width=60)}...")
    print(f"Metadata: {node.metadata}")

Total nodes created: 6

Node 1:
Text: # AI Engineering Fundamentals...
Metadata: {'header_path': '/'}

Node 2:
Text: ## Introduction to Vector Databases Vector databases [...]...
Metadata: {'header_path': '/AI Engineering Fundamentals/'}

Node 3:
Text: ### Key Advantages - Efficient similarity search - [...]...
Metadata: {'header_path': '/AI Engineering Fundamentals/Introduction to Vector Databases/'}

Node 4:
Text: ### Common Operations 1. **Vector Indexing**: Creating [...]...
Metadata: {'header_path': '/AI Engineering Fundamentals/Introduction to Vector Databases/'}

Node 5:
Text: ## Working with Embeddings Embeddings are dense [...]...
Metadata: {'header_path': '/AI Engineering Fundamentals/'}

Node 6:
Text: ### Popular Embedding Models - OpenAI text-embedding- [...]...
Metadata: {'header_path': '/AI Engineering Fundamentals/Working with Embeddings/'}


In [2]:
from llama_index.core.node_parser import HTMLNodeParser
from llama_index.core import Document
import textwrap
from bs4 import BeautifulSoup

# Sample HTML document
html_text = """
<html>
<body>
  <h1>AI Engineering Fundamentals</h1>
  
  <h2>Introduction to Vector Databases</h2>
  <p>Vector databases are specialized database systems designed to store and query vector embeddings efficiently.</p>
  
  <h3>Key Advantages</h3>
  <ul>
    <li>Efficient similarity search</li>
    <li>Scalable to billions of vectors</li>
    <li>Support for metadata filtering</li>
  </ul>
  
  <h3>Common Operations</h3>
  <ol>
    <li><b>Vector Indexing</b>: Creating data structures for efficient search</li>
    <li><b>Approximate Nearest Neighbor Search</b>: Finding similar vectors quickly</li>
    <li><b>Hybrid Search</b>: Combining vector similarity with metadata filters</li>
  </ol>
  
  <h2>Working with Embeddings</h2>
  <p>Embeddings are dense numerical representations of data that capture semantic meaning.</p>
  
  <table border="1">
    <tr>
      <th>Model Name</th>
      <th>Dimensions</th>
      <th>Use Case</th>
    </tr>
    <tr>
      <td>text-embedding-ada-002</td>
      <td>1536</td>
      <td>General text embeddings</td>
    </tr>
    <tr>
      <td>all-MiniLM-L6-v2</td>
      <td>384</td>
      <td>Efficient semantic search</td>
    </tr>
  </table>
</body>
</html>
"""

# Create a document
html_document = Document(text=html_text)

# Create HTML parser
html_parser = HTMLNodeParser()

# Parse the document
html_nodes = html_parser.get_nodes_from_documents([html_document])

# Display the resulting nodes
print(f"Total HTML nodes created: {len(html_nodes)}")
for i, node in enumerate(html_nodes):
    print(f"\nNode {i+1}:")
    print(f"Text: {textwrap.shorten(node.text, width=60)}...")
    print(f"Metadata: {node.metadata}")

# Extract table data specifically


def extract_tables(html_content):
    soup = BeautifulSoup(html_content, 'html.parser')
    tables = soup.find_all('table')

    extracted_tables = []
    for table in tables:
        rows = table.find_all('tr')
        table_data = []

        for row in rows:
            cols = row.find_all(['td', 'th'])
            row_data = [col.text.strip() for col in cols]
            table_data.append(row_data)

        extracted_tables.append(table_data)

    return extracted_tables


# Extract tables
tables = extract_tables(html_text)
print("\nExtracted Table:")
for row in tables[0]:
    print(row)

Total HTML nodes created: 14

Node 1:
Text: AI Engineering Fundamentals...
Metadata: {'tag': 'h1'}

Node 2:
Text: Introduction to Vector Databases...
Metadata: {'tag': 'h2'}

Node 3:
Text: Vector databases are specialized database systems [...]...
Metadata: {'tag': 'p'}

Node 4:
Text: Key Advantages...
Metadata: {'tag': 'h3'}

Node 5:
Text: Efficient similarity search Scalable to billions of [...]...
Metadata: {'tag': 'li'}

Node 6:
Text: Common Operations...
Metadata: {'tag': 'h3'}

Node 7:
Text: : Creating data structures for efficient search...
Metadata: {'tag': 'li'}

Node 8:
Text: Vector Indexing...
Metadata: {'tag': 'b'}

Node 9:
Text: : Finding similar vectors quickly...
Metadata: {'tag': 'li'}

Node 10:
Text: Approximate Nearest Neighbor Search...
Metadata: {'tag': 'b'}

Node 11:
Text: : Combining vector similarity with metadata filters...
Metadata: {'tag': 'li'}

Node 12:
Text: Hybrid Search...
Metadata: {'tag': 'b'}

Node 13:
Text: Working with Embeddings...
Metadata: {'tag':

In [3]:
# Function to create a searchable document map
def create_document_map(nodes):
    """Create a searchable map of document sections"""
    document_map = {}

    for i, node in enumerate(nodes):
        # Get the heading or create a default one
        heading = node.metadata.get("heading", f"Section {i+1}")
        level = node.metadata.get("heading_level", 0)

        # Add to document map with indent based on level
        indent = "  " * (level - 1) if level > 0 else ""
        document_map[heading] = {
            "index": i,
            "level": level,
            "text": node.text,
            "display": f"{indent}{heading}"
        }

    return document_map


# Create document map from our markdown nodes
doc_map = create_document_map(nodes)

# Display the document structure as a table of contents
print("Document Table of Contents:")
for heading, info in doc_map.items():
    print(f"{info['display']}")

# Simple section lookup function


def find_section(query, doc_map):
    """Find sections that match a query string"""
    matches = []

    for heading, info in doc_map.items():
        # Check if query is in heading or content
        if query.lower() in heading.lower() or query.lower() in info['text'].lower():
            matches.append((heading, info))

    return matches


# Try looking up sections
search_terms = ["advantages", "embedding models", "indexing"]

for term in search_terms:
    print(f"\nSearching for '{term}':")
    results = find_section(term, doc_map)

    if results:
        for heading, info in results:
            print(f"Found in: {info['display']}")
            # Extract a relevant snippet
            text = info['text']
            start = max(0, text.lower().find(term.lower()) - 40)
            snippet = text[start:start+100] + "..."
            print(f"Snippet: {snippet}")
    else:
        print("No results found")

Document Table of Contents:
Section 1
Section 2
Section 3
Section 4
Section 5
Section 6

Searching for 'advantages':
Found in: Section 3
Snippet: ### Key Advantages
- Efficient similarity search
- Scalable to billions of vectors
- Support for met...

Searching for 'embedding models':
Found in: Section 6
Snippet: ### Popular Embedding Models
- OpenAI text-embedding-ada-002
- Sentence Transformers
- CLIP for imag...

Searching for 'indexing':
Found in: Section 4
Snippet: ### Common Operations
1. **Vector Indexing**: Creating data structures for efficient search
2. **App...
