In [4]:
from pathlib import Path
from haystack.components.converters import HTMLToDocument

converter = HTMLToDocument()

docs = converter.run(sources=[Path("dinosaur-page.html")])
print(docs)

{'documents': [Document(id=88b9fcd7c6326ff98c525387fc45f0c88e01e733c32c59cf2e788d1cd7cc92ff, content: 'Dinosaurs are a diverse group of reptiles of the clade Dinosauria. They first appeared during the Tr...', meta: {'file_path': 'dinosaur-page.html'})]}


In [11]:
from haystack.components.preprocessors import DocumentSplitter

splitter = DocumentSplitter(split_by="sentence", split_length=1, split_overlap=0)

In [12]:
print(splitter.run(docs["documents"]))

{'documents': [Document(id=4e7f85fc82ffe21e4d8ba53be01a5560194d20d34f09b0e987e7b43e8e3edb21, content: 'Dinosaurs are a diverse group of reptiles of the clade Dinosauria.', meta: {'file_path': 'dinosaur-page.html', 'source_id': '88b9fcd7c6326ff98c525387fc45f0c88e01e733c32c59cf2e788d1cd7cc92ff', 'page_number': 1, 'split_id': 0, 'split_idx_start': 0}), Document(id=8ab98b5b9f7714b5bb3e773f81d290a9a9d6f9c4427b4b45ef0a5901c85af38f, content: ' They first appeared during the Triassic period, between 243 and 233.', meta: {'file_path': 'dinosaur-page.html', 'source_id': '88b9fcd7c6326ff98c525387fc45f0c88e01e733c32c59cf2e788d1cd7cc92ff', 'page_number': 1, 'split_id': 1, 'split_idx_start': 67}), Document(id=0dc66d91444509373f99f412788963ce43bc008c9422bc4e0e62510bf0a303f7, content: '23 million years ago (mya), although the exact origin and timing of the evolution of dinosaurs is a ...', meta: {'file_path': 'dinosaur-page.html', 'source_id': '88b9fcd7c6326ff98c525387fc45f0c88e01e733c32c59cf2e788d1cd

In [13]:
splitter = DocumentSplitter(split_by="passage", split_length=1, split_overlap=0)
print(splitter.run(docs["documents"]))

{'documents': [Document(id=60a15918babab9c042228a6e65ca89c52a37aa96c977e60aabb5896b8c067780, content: 'Dinosaurs are a diverse group of reptiles of the clade Dinosauria. They first appeared during the Tr...', meta: {'file_path': 'dinosaur-page.html', 'source_id': '88b9fcd7c6326ff98c525387fc45f0c88e01e733c32c59cf2e788d1cd7cc92ff', 'page_number': 1, 'split_id': 0, 'split_idx_start': 0})]}


Both methods, split_by passage or split_by sentence yield poor results. The first outputs just a single chunk. The latter splits by sentence as expected (but which is not desirable) but also falters when splitting sentences with floating point numbers. For example, "They first appeared during the Triassic period, between 243 and 233.23 million years ago (mya), although the exact origin and timing of the evolution of dinosaurs is a ..." is split into two sentences "They first appeared during the Triassic period, between 243 and 233." and "23 million years ago (mya), although the exact origin and timing of the evolution of dinosaurs is a ...", which is incorrect 

In [17]:
from pathlib import Path
from haystack.components.converters import HTMLToDocument

converter = HTMLToDocument(extractor_type="LargestContentExtractor")

docs = converter.run(sources=[Path("dinosaur-page.html")])
print(docs)



{'documents': [Document(id=88b9fcd7c6326ff98c525387fc45f0c88e01e733c32c59cf2e788d1cd7cc92ff, content: 'Dinosaurs are a diverse group of reptiles of the clade Dinosauria. They first appeared during the Tr...', meta: {'file_path': 'dinosaur-page.html'})]}


In [18]:
splitter = DocumentSplitter(split_by="passage", split_length=1, split_overlap=0)
print(splitter.run(docs["documents"]))

{'documents': [Document(id=60a15918babab9c042228a6e65ca89c52a37aa96c977e60aabb5896b8c067780, content: 'Dinosaurs are a diverse group of reptiles of the clade Dinosauria. They first appeared during the Tr...', meta: {'file_path': 'dinosaur-page.html', 'source_id': '88b9fcd7c6326ff98c525387fc45f0c88e01e733c32c59cf2e788d1cd7cc92ff', 'page_number': 1, 'split_id': 0, 'split_idx_start': 0})]}


Tried applying different extractor_type, but result is still unsatisfactory.

Options:
- use lanchain's HTMLSectionSplitter (https://python.langchain.com/v0.2/docs/how_to/HTML_section_aware_splitter/) and transform langchain document to haystack document
- create custom component in Haystack and return list of Documents, as expected out of DocumentSplitter: https://docs.haystack.deepset.ai/docs/data-classes#document