## HTML text splitter 

HTMLHeaderTextsplitter is a "structure-aware" chunking method that splits documents into chunks based on the presence of HTML headers.it keep related chunks together and separates unrelated chunks.

In [2]:
from langchain.text_splitter import HTMLHeaderTextSplitter

## create a html_string of a portofolio page of ai engineer 
html_string = """
<!DOCTYPE html>
<html>
<head>
    <title>AI Engineer Portfolio</title>
</head>
<body>
    <h1>AI Engineer Portfolio</h1>
    <p>Welcome to my portfolio page! Here, you can find a collection of my AI projects and experiences. I have worked on a variety of projects, including natural language processing, computer vision, and machine learning. I have also contributed to open-source projects and have a strong interest in data science and AI.</p>
    <h2>Natural Language Processing</h2>
    <div class="project">
        <h3>Text Summarization</h3>
        <p>I have developed a text summarization model that can generate concise summaries of long documents. This project involved training a model on a dataset of news articles and then using it to generate summaries for new articles. The model achieved an accuracy of 95% on the dataset.</p>
        <p><a href="https://github.com/ai-engineer/text-summarization">View Project on GitHub</a></p>
    </div>
    <div class="project">
        <h3>Sentiment Analysis</h3>
        <p>I have developed a sentiment analysis model that can classify text as positive, negative, or neutral. This project involved training a model on a dataset of customer reviews and then using it to classify new reviews. The model achieved an accuracy of 90% on the dataset.</p>
        <p><a href="https://github.com/ai-engineer/sentiment-analysis">View Project on GitHub</a></p>
    </div>
    <div class="project">
        <h3>Named Entity Recognition</h3>
        <p>I have developed a named entity recognition model that can identify and classify named entities in text. This project involved training a model on a dataset of news articles and then using it to identify named entities in new articles. The model achieved an accuracy of 90% on the dataset.</p>
        <p><a href="https://github.com/ai-engineer/named-entity-recognition">View Project on GitHub</a></p>
    </div>
    <div class="project">
        <h3>Language Translation</h3>    
        <p>I have developed a language translation model that can translate text from one language to another. This project involved training a model on a dataset of translation pairs and then using it to translate new text. The model achieved an accuracy of 90% on the dataset.</p>
        <p><a href="https://github.com/ai-engineer/language-translation">View Project on GitHub</a></p>
    </div>
  
 </body>
 </html>
 """


In [8]:
headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3")
]

html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
html_header_splits = html_splitter.split_text(html_string)
html_header_splits

[Document(metadata={'Header 1': 'AI Engineer Portfolio'}, page_content='AI Engineer Portfolio'),
 Document(metadata={'Header 1': 'AI Engineer Portfolio'}, page_content='Welcome to my portfolio page! Here, you can find a collection of my AI projects and experiences. I have worked on a variety of projects, including natural language processing, computer vision, and machine learning. I have also contributed to open-source projects and have a strong interest in data science and AI.'),
 Document(metadata={'Header 1': 'AI Engineer Portfolio', 'Header 2': 'Natural Language Processing'}, page_content='Natural Language Processing'),
 Document(metadata={'Header 1': 'AI Engineer Portfolio', 'Header 2': 'Natural Language Processing', 'Header 3': 'Text Summarization'}, page_content='Text Summarization'),
 Document(metadata={'Header 1': 'AI Engineer Portfolio', 'Header 2': 'Natural Language Processing', 'Header 3': 'Text Summarization'}, page_content='I have developed a text summarization model that

In [9]:
headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3")
]

html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
html_header_splits = html_splitter.split_text(html_string)
html_header_splits

[Document(metadata={'Header 1': 'AI Engineer Portfolio'}, page_content='AI Engineer Portfolio'),
 Document(metadata={'Header 1': 'AI Engineer Portfolio'}, page_content='Welcome to my portfolio page! Here, you can find a collection of my AI projects and experiences. I have worked on a variety of projects, including natural language processing, computer vision, and machine learning. I have also contributed to open-source projects and have a strong interest in data science and AI.'),
 Document(metadata={'Header 1': 'AI Engineer Portfolio', 'Header 2': 'Natural Language Processing'}, page_content='Natural Language Processing'),
 Document(metadata={'Header 1': 'AI Engineer Portfolio', 'Header 2': 'Natural Language Processing', 'Header 3': 'Text Summarization'}, page_content='Text Summarization'),
 Document(metadata={'Header 1': 'AI Engineer Portfolio', 'Header 2': 'Natural Language Processing', 'Header 3': 'Text Summarization'}, page_content='I have developed a text summarization model that

In [10]:
url = "https://lilianweng.github.io/posts/2023-06-23-agent/"

headers_to_split_on = [
    ("h1", "Header 1"),
    ("h2", "Header 2"),
    ("h3", "Header 3")
]
html_splitter = HTMLHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
html_header_splits = html_splitter.split_text_from_url(url)
html_header_splits

[Document(metadata={}, page_content='if (localStorage.getItem("pref-theme") === "dark") {\n        document.body.classList.add(\'dark\');\n    } else if (localStorage.getItem("pref-theme") === "light") {\n        document.body.classList.remove(\'dark\')\n    } else if (window.matchMedia(\'(prefers-color-scheme: dark)\').matches) {\n        document.body.classList.add(\'dark\');\n    }  \nMathJax = {\n    tex: {\n      inlineMath: [[\'$\', \'$\'], [\'\\\\(\', \'\\\\)\']],\n      displayMath: [[\'$$\',\'$$\'], [\'\\\\[\', \'\\\\]\']],\n      processEscapes: true,\n      processEnvironments: true\n    },\n    options: {\n      skipHtmlTags: [\'script\', \'noscript\', \'style\', \'textarea\', \'pre\']\n    }\n  };\n\n  window.addEventListener(\'load\', (event) => {\n      document.querySelectorAll("mjx-container").forEach(function(x){\n        x.parentElement.classList += \'has-jax\'})\n    });  \nLil\'Log  \n|  \nPosts  \nArchive  \nSearch  \nTags  \nFAQ'),
 Document(metadata={'Header 1':