HTMLHeaderTextSplitter is a "structure-aware" chunker that splits text at the HTML element level and adds metadata for each header "relevant" to any given chunk. It can return chunks element by element or combine elements with the same metadata, with the objective of (a) keeping related text grouped (more or less) semantically and (b) preserving context-rich information encoded in document structures. It can be used with other text splitters as part of a chunking pipeline.

In [4]:
from langchain_text_splitters import HTMLHeaderTextSplitter

html_string = """
<!DOCTYPE html>
<html lang="en">
<body>

    <h1>Welcome to My Webpage</h1>
    
    <h2>About Me</h2>
    <p>Hello! My name is Najeeb, and I'm passionate about technology and design.</p>

    <h2>Hobbies</h2>
    <div>
        <p>I have several hobbies that I enjoy in my free time:</p>
        <ul>
            <li>Coding</li>
            <li>Reading Books</li>
            <li>Traveling</li>
            <li>Photography</li>
        </ul>
    </div>

    <h2>Contact</h2>
    <p>If you'd like to get in touch, feel free to reach out via email!</p>

</body>
</html>
"""

headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2")
]

html_splitter = HTMLHeaderTextSplitter(headers_to_split_on)
html_header_splits = html_splitter.split_text(html_string)
html_header_splits

[Document(metadata={'Header 1': 'Welcome to My Webpage', 'Header 2': 'About Me'}, page_content="Hello! My name is Najeeb, and I'm passionate about technology and design."),
 Document(metadata={'Header 1': 'Welcome to My Webpage', 'Header 2': 'Hobbies'}, page_content='I have several hobbies that I enjoy in my free time:  \nCoding Reading Books Traveling Photography'),
 Document(metadata={'Header 1': 'Welcome to My Webpage', 'Header 2': 'Contact'}, page_content="If you'd like to get in touch, feel free to reach out via email!")]

In [5]:
url = "https://plato.stanford.edu/"

headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3"),
    ("h4", "Header 4"),
]
html_splitter = HTMLHeaderTextSplitter(headers_to_split_on)
html_header_splits = html_splitter.split_text_from_url(url)
html_header_splits

[Document(metadata={}, page_content="Stanford Encyclopedia of Philosophy  \nMenu  \nBrowse About Support SEP  \nTable of Contents What's New Random Entry Chronological Archives  \nEditorial Information About the SEP Editorial Board How to Cite the SEP Special Characters Advanced Tools Contact  \nSupport the SEP PDFs for SEP Friends Make a Donation SEPIA for Libraries  \nSearch Search Tips  \nBrowse"),
 Document(metadata={'Header 2': 'Browse'}, page_content="Table of Contents  \nWhat's New Archives Random Entry"),
 Document(metadata={}, page_content='The Stanford Encyclopedia of Philosophy organizes scholars from around the world in philosophy and related disciplines to create and maintain an up-to-date reference work.  \nCo-Principal Editors: Edward N. Zalta and Uri Nodelman  \nMasthead | Editorial Board  \nCurrent Operations Are Supported By:'),
 Document(metadata={'Header 4': 'Current Operations Are Supported By:'}, page_content='The Offices of the Provost, the Dean of Humanities and