### Test

In [2]:
# Add the parent directory of 'wiki' to the Python path
sys.path.append(os.path.abspath(os.path.join(os.getcwd(), '..')))

import sys
import os
from pathlib import Path
from haystack.components.converters import TextFileToDocument
from haystack import Pipeline
from wiki.lib.index.chunk.wiki_page_chunker import WikiPageChunker



In [3]:
converter = TextFileToDocument()
splitter = WikiPageChunker()

chunk_pipeline = Pipeline()

chunk_pipeline.add_component("converter", converter)
chunk_pipeline.add_component("splitter", splitter)

chunk_pipeline.connect("converter", "splitter")

result = chunk_pipeline.run(data={"converter": {"sources": [Path("data/v2/Dinosaurs/Dinosaur.html")], "meta": {"page_title": "Dinosaur"}}})

In [27]:
result

{'splitter': {'documents': [Document(id=0e1040d8-2918-4e0f-96f4-c9aa77933d72, content: 'Dinosaurs are a diverse group of reptiles of the clade Dinosauria . They first appeared during the T...', meta: {'file_path': 'data/v2/Dinosaurs/Dinosaur.html', 'source_id': '3283d9d1d64425e10055eed8bc2bfb821c10a22b1c4c33964b651d49788b7918', 'split_id': 0, 'title': 'Dinosaurs'}),
   Document(id=5f1e55ac-539d-4494-b2c8-4ab22e881604, content: 'Dinosaurs are varied from taxonomic, morphological and ecological standpoints. Birds, at over 11,000...', meta: {'file_path': 'data/v2/Dinosaurs/Dinosaur.html', 'source_id': '3283d9d1d64425e10055eed8bc2bfb821c10a22b1c4c33964b651d49788b7918', 'split_id': 1, 'title': 'Dinosaurs'}),
   Document(id=9b459eb2-6332-4a7e-9181-9d3d7af65096, content: 'While dinosaurs were ancestrally bipedal, many extinct groups included quadrupedal species, and some...', meta: {'file_path': 'data/v2/Dinosaurs/Dinosaur.html', 'source_id': '3283d9d1d64425e10055eed8bc2bfb821c10a22b1c4c33964

Does pipeline result have a .to_json()?

In [26]:
result_dict = result.to_json()

AttributeError: 'dict' object has no attribute 'to_json'

### Experiment: Document to python dict and back

In [20]:
sample_doc = result["splitter"]["documents"][0]
sample_doc

Document(id=0e1040d8-2918-4e0f-96f4-c9aa77933d72, content: 'Dinosaurs are a diverse group of reptiles of the clade Dinosauria . They first appeared during the T...', meta: {'file_path': 'data/v2/Dinosaurs/Dinosaur.html', 'source_id': '3283d9d1d64425e10055eed8bc2bfb821c10a22b1c4c33964b651d49788b7918', 'split_id': 0, 'title': 'Dinosaurs'})

In [21]:
sample_doc_dict = sample_doc.to_dict()
sample_doc_dict

{'id': '0e1040d8-2918-4e0f-96f4-c9aa77933d72',
 'content': 'Dinosaurs are a diverse group of reptiles of the clade Dinosauria . They first appeared during the Triassic period, between 243 and 233.23 million years ago (mya), although the exact origin and timing of the evolution of dinosaurs is a subject of active research. They became the dominant terrestrial vertebrates after the Triassic–Jurassic extinction event 201.3 mya and their dominance continued throughout the Jurassic and Cretaceous periods. The fossil record shows that birds are feathered dinosaurs, having evolved from earlier theropods during the Late Jurassic epoch, and are the only dinosaur lineage known to have survived the Cretaceous–Paleogene extinction event approximately 66 mya. Dinosaurs can therefore be divided into avian dinosaurs —birds—and the extinct non-avian dinosaurs , which are all dinosaurs other than birds.',
 'dataframe': None,
 'blob': None,
 'score': None,
 'embedding': None,
 'sparse_embedding': None,


In [22]:
from haystack import Document
recovered_sample_doc = Document.from_dict(sample_doc_dict)
recovered_sample_doc


Document(id=0e1040d8-2918-4e0f-96f4-c9aa77933d72, content: 'Dinosaurs are a diverse group of reptiles of the clade Dinosauria . They first appeared during the T...', meta: {'file_path': 'data/v2/Dinosaurs/Dinosaur.html', 'source_id': '3283d9d1d64425e10055eed8bc2bfb821c10a22b1c4c33964b651d49788b7918', 'split_id': 0, 'title': 'Dinosaurs'})

In [23]:
# Just for fun
sample_doc == recovered_sample_doc

True

Cool! The equality comparison may be works on the Document.id, which makes sense

***Sanity check for results as a whole:***

In [28]:
documents = result["splitter"]["documents"]
result["splitter"]["documents"] = [doc.to_dict() for doc in documents]

result

{'splitter': {'documents': [{'id': '0e1040d8-2918-4e0f-96f4-c9aa77933d72',
    'content': 'Dinosaurs are a diverse group of reptiles of the clade Dinosauria . They first appeared during the Triassic period, between 243 and 233.23 million years ago (mya), although the exact origin and timing of the evolution of dinosaurs is a subject of active research. They became the dominant terrestrial vertebrates after the Triassic–Jurassic extinction event 201.3 mya and their dominance continued throughout the Jurassic and Cretaceous periods. The fossil record shows that birds are feathered dinosaurs, having evolved from earlier theropods during the Late Jurassic epoch, and are the only dinosaur lineage known to have survived the Cretaceous–Paleogene extinction event approximately 66 mya. Dinosaurs can therefore be divided into avian dinosaurs —birds—and the extinct non-avian dinosaurs , which are all dinosaurs other than birds.',
    'dataframe': None,
    'blob': None,
    'score': None,
    'em

Works! Let's convert result with objects back to documents.

In [29]:
documents = result["splitter"]["documents"]
result["splitter"]["documents"] = [Document.from_dict(doc) for doc in documents]

result

{'splitter': {'documents': [Document(id=0e1040d8-2918-4e0f-96f4-c9aa77933d72, content: 'Dinosaurs are a diverse group of reptiles of the clade Dinosauria . They first appeared during the T...', meta: {'file_path': 'data/v2/Dinosaurs/Dinosaur.html', 'source_id': '3283d9d1d64425e10055eed8bc2bfb821c10a22b1c4c33964b651d49788b7918', 'split_id': 0, 'title': 'Dinosaurs'}),
   Document(id=5f1e55ac-539d-4494-b2c8-4ab22e881604, content: 'Dinosaurs are varied from taxonomic, morphological and ecological standpoints. Birds, at over 11,000...', meta: {'file_path': 'data/v2/Dinosaurs/Dinosaur.html', 'source_id': '3283d9d1d64425e10055eed8bc2bfb821c10a22b1c4c33964b651d49788b7918', 'split_id': 1, 'title': 'Dinosaurs'}),
   Document(id=9b459eb2-6332-4a7e-9181-9d3d7af65096, content: 'While dinosaurs were ancestrally bipedal, many extinct groups included quadrupedal species, and some...', meta: {'file_path': 'data/v2/Dinosaurs/Dinosaur.html', 'source_id': '3283d9d1d64425e10055eed8bc2bfb821c10a22b1c4c33964

Alright, Document conversion and back works.