# Document Transformers

In [None]:
from langchain_text_splitters import CharacterTextSplitter

text = """The world of street food is a vibrant tapestry of flavors and cultures. From sizzling skewers to spicy tacos, every corner of the globe has its own unique offerings. Vendors often  up shop in bustling markets or on busy street corners, attracting hungry passersby with mouthwatering aromas.

One of the best things about street food is its accessibility. It’s quick, affordable, and often made fresh right in front of you. Whether it’s a steaming bowl of pho in Vietnam or a crispy samosa in India, there’s something for everyone to enjoy.

Street food also brings people together. Friends and families gather around food stalls, sharing dishes and stories. It’s a social experience that transcends language and culture, creating connections over a shared love of good eats.

Finally, street food is constantly evolving. Chefs are experimenting with traditional recipes, adding modern twists and fusion flavors. This creativity keeps the scene exciting and ensures there’s always something new to try."""



## Length Based Splitter

- Token-based: Splits text based on the number of tokens, which is useful when working with language models.
- Character-based: Splits text based on the number of characters, which can be more consistent across different types of text.




In [None]:

text_splitter = CharacterTextSplitter.from_tiktoken_encoder(
    encoding_name="cl100k_base", chunk_size=100, chunk_overlap=20
)


docs = text_splitter.create_documents([text])

print(type(docs))

print(len(docs))

print(docs[0])


## Text-strucutre based splitter

- The RecursiveCharacterTextSplitter attempts to keep larger units (e.g., paragraphs) intact.
- If a unit exceeds the chunk size, it moves to the next level (e.g., sentences).
- This process continues down to the word level if necessary.

In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=10)

docs = text_splitter.create_documents([text])

print(type(docs))

print(len(docs))

print(docs[0])

## Document-structured based
- Preserves the logical organization of the document
- Maintains context within each chunk
- Can be more effective for downstream tasks like retrieval or summarization

**Examples of structure-based splitting:**
- Markdown: Split based on headers (e.g., #, ##, ###)
- HTML: Split using tags
- JSON: Split by object or array elements
- Code: Split by functions, classes, or logical blocks

### Markdown Splitter


In [None]:
from langchain_text_splitters import MarkdownHeaderTextSplitter

# read markdown file
with open("data/sample.md", "r") as f:
    md_text = f.read()
    
md_text

In [None]:

headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
]

markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on, strip_headers=False)

docs = markdown_splitter.split_text(md_text)

print(type(docs))

print(len(docs))

docs



In [None]:
# Char-level splits
from attr import s
from langchain_text_splitters import RecursiveCharacterTextSplitter

chunk_size = 10
chunk_overlap = 2
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=chunk_size, chunk_overlap=chunk_overlap
)

# Split
splits = text_splitter.split_documents(docs)
splits

### HTML Splitter
#### Header Splitter

In [None]:
from langchain_text_splitters import HTMLHeaderTextSplitter

html_string = """
    <!DOCTYPE html>
    <html>
    <body>
        <div>
            <h1>Foo</h1>
            <p>Some intro text about Foo.</p>
            <div>
                <h2>Bar main section</h2>
                <p>Some intro text about Bar.</p>
                <h3>Bar subsection 1</h3>
                <p>Some text about the first subtopic of Bar.</p>
                <h3>Bar subsection 2</h3>
                <p>Some text about the second subtopic of Bar.</p>
            </div>
            <div>
                <h2>Baz</h2>
                <p>Some text about Baz</p>
            </div>
            <br>
            <p>Some concluding text about Foo</p>
        </div>
    </body>
    </html>
"""


headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3"),
]

html_splitter = HTMLHeaderTextSplitter(headers_to_split_on)
html_header_splits = html_splitter.split_text(html_string)
html_header_splits

#### Section Splitter

In [None]:
from langchain_text_splitters import HTMLSectionSplitter



headers_to_split_on = [("h1", "Header 1"), ("h2", "Header 2")]

html_splitter = HTMLSectionSplitter(headers_to_split_on)
html_header_splits = html_splitter.split_text(html_string)
html_header_splits

## Semantic Meaning Based

- Start with the first few sentences and generate an embedding.
- Move to the next group of sentences and generate another embedding (e.g., using a sliding window approach).
- Compare the embeddings to find significant differences, which indicate potential "break points" between semantic sections.


In [None]:
from langchain_experimental.text_splitter import SemanticChunker
from langchain_ollama import OllamaEmbeddings

embedding_model = OllamaEmbeddings(model="mxbai-embed-large")
# embeddings: Embeddings,
# buffer_size: int = 1,
# add_start_index: bool = False,
# breakpoint_threshold_type: BreakpointThresholdType = "percentile",
# breakpoint_threshold_amount: Optional[float] = None,
# number_of_chunks: Optional[int] = None,
# sentence_split_regex: str = r"(?<=[.?!])\s+",
# min_chunk_size: Optional[int] = None,
text_splitter = SemanticChunker(embedding_model,number_of_chunks=3)

docs = text_splitter.create_documents([text])

print(type(docs))

print(len(docs))


docs

