In this notebook, I will demo how to use `LlamaIndex` to generata a RAG from a given document. At the meanwhile, I will explore the package from my own curiosity. 

Pre-requisite:
- Install the package `pip install llama-index`

In [None]:
from llama_index.core import SimpleDirectoryReader
from llama_index.readers.file import FlatReader, MarkdownReader
from llama_index.core.node_parser import MarkdownNodeParser
from pathlib import Path

from mistune import markdown

## load the document
- case 1: load by `SimpleDirectoryReader`
- case 3: load by `FlatReader`

In [None]:
import os
print(os.getcwd())

In [26]:
simple_reader_docs = SimpleDirectoryReader(
    input_files = ['data/summary.md']
).load_data()

markdown_reader_docs = MarkdownReader().load_data(Path('data/summary.md'))

flat_reader_docs = FlatReader().load_data(Path('data/summary.md'))

print('the length of simple_reader_docs:', len(simple_reader_docs))
print('the length of flat_reader_docs:', len(flat_reader_docs))
print('the length of markdown_reader_docs:', len(markdown_reader_docs))


print('the first doc of simple_reader_docs:', simple_reader_docs[0])
print('the first doc of flat_reader_docs:', flat_reader_docs[0])
print('the first doc of markdown_reader_docs:', simple_reader_docs[0])

the length of simple_reader_docs: 7
the length of flat_reader_docs: 1
the length of markdown_reader_docs: 7
the first doc of simple_reader_docs: Doc ID: 4809984c-c771-4415-9200-33c6f0a58846
Text: Introducing Our Comprehensive Forecasting Service    Welcome to
our state-of-the-art forecasting service, designed to transform your
historical data into reliable future predictions. Our service is
tailored to handle various data types, including daily, intraday, and
special datasets, ensuring that your forecasting needs are met with
precision a...
the first doc of flat_reader_docs: Doc ID: 90cf5042-b42c-4b1b-8573-020ac28c68d5
Text: Introducing Our Comprehensive Forecasting Service  Welcome to
our state-of-the-art forecasting service, designed to transform your
historical data into reliable future predictions. Our service is
tailored to handle various data types, including daily, intraday, and
special datasets, ensuring that your forecasting needs are met with
precision and...
the first doc of

`SimpleDirectoryReader` and `MarkdownReader` split the document by the Markdown header. `FlatReader` does not split the document.

# Parser

In [36]:
md_nodes = MarkdownNodeParser().get_nodes_from_documents(simple_reader_docs)
print('the length of md_nodes:', len(md_nodes))
md_nodes[2]

the length of md_nodes: 7


TextNode(id_='4a04e4f5-cddd-4afe-a003-054695e8132a', embedding=None, metadata={'file_path': 'data\\summary.md', 'file_name': 'summary.md', 'file_size': 3007, 'creation_date': '2024-09-20', 'last_modified_date': '2024-09-19'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='e6849d57-0165-4559-bc5d-7ffe9fb5e1d6', node_type=<ObjectType.DOCUMENT: '4'>, metadata={'file_path': 'data\\summary.md', 'file_name': 'summary.md', 'file_size': 3007, 'creation_date': '2024-09-20', 'last_modified_date': '2024-09-19'}, hash='07201353024e9e46a046c365993322061e5e1dc6695cdfbaa9ed7dd0c6324138')}, text='Required Input Data for Forecasting\nOur service requires historical data with dates and values. Here are some examples of the 

In [37]:
md_nodes = MarkdownNodeParser().get_nodes_from_documents(flat_reader_docs)
print('the length of md_nodes:', len(md_nodes))
md_nodes[2]

the length of md_nodes: 7


TextNode(id_='d03b0ab9-3459-4196-9fed-2db452ea4bfc', embedding=None, metadata={'filename': 'summary.md', 'extension': '.md', 'Header_3': 'Key Features', 'Header_4': 'Required Input Data for Forecasting'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='90cf5042-b42c-4b1b-8573-020ac28c68d5', node_type=<ObjectType.DOCUMENT: '4'>, metadata={'filename': 'summary.md', 'extension': '.md'}, hash='1f5c3c74c7685dc67c55633c62961c193d29c51bb3fc6eee5b421fde929ad2f1'), <NodeRelationship.PREVIOUS: '2'>: RelatedNodeInfo(node_id='a3c40651-3a67-4f3f-ab07-32ae107c694c', node_type=<ObjectType.TEXT: '1'>, metadata={'filename': 'summary.md', 'extension': '.md', 'Header_3': 'Key Features'}, hash='e4bd94c390dfa75acc83eab7aa86b4b2ba0d09b2fb01726692f17400f17ab0e5'), <NodeRelationship.NEXT: '3'>: RelatedNodeInfo(node_id='e350f2da-8a57-4749-89ba-1a1705b900f3', node_type=<ObjectType.TEXT: '1'>, metadata={'Header_3': 'Key Featu

In [29]:
md_nodes = MarkdownNodeParser().get_nodes_from_documents(markdown_reader_docs)
print('the length of md_nodes:', len(md_nodes))
md_nodes[0]

the length of md_nodes: 7


TextNode(id_='5a9cc8ab-d443-471c-93b7-1719f9183f5c', embedding=None, metadata={}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={<NodeRelationship.SOURCE: '1'>: RelatedNodeInfo(node_id='47836f64-a8df-4741-9526-eedb61d97ffe', node_type=<ObjectType.DOCUMENT: '4'>, metadata={}, hash='d1748259d999b3209eea05a842f23f5e9e7fda530fa3a26aed5bf29518baa7ec')}, text='Introducing Our Comprehensive Forecasting Service\r\n\r\nWelcome to our state-of-the-art forecasting service, designed to transform your historical data into reliable future predictions. Our service is tailored to handle various data types, including daily, intraday, and special datasets, ensuring that your forecasting needs are met with precision and flexibility.', mimetype='text/plain', start_char_idx=0, end_char_idx=362, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_seperator='\n')

If documents are loaded from `FlatReader` and parser by `MarkdownNodeParser`, the Markdown header will be parsed in the metadata. When I looked into source code
```python
def get_nodes_from_node(self, node: BaseNode) -> List[TextNode]:
    """Get nodes from document."""
    text = node.get_content(metadata_mode=MetadataMode.NONE)
    markdown_nodes = []
    lines = text.split("\n")
    metadata: Dict[str, str] = {}
    code_block = False
    current_section = ""

    for line in lines:
        if line.lstrip().startswith("```"):
            code_block = not code_block
        header_match = re.match(r"^(#+)\s(.*)", line)
        if header_match and not code_block:
            if current_section != "":
                markdown_nodes.append(
                    self._build_node_from_split(
                        current_section.strip(), node, metadata
                    )
                )
            metadata = self._update_metadata(
                metadata, header_match.group(2), len(header_match.group(1).strip())
            )
            current_section = f"{header_match.group(2)}\n"
        else:
            current_section += line + "\n"

    markdown_nodes.append(
        self._build_node_from_split(current_section.strip(), node, metadata)
    )

    return markdown_nodes
```
It shows that if the reader has not split the document by the Markdown header, the parser will split the document by the Markdown header, and save the header in the metadata and main the structure of the document in the content.
My markdown file is well structured, so I will use `FlatReader` to load the document and `MarkdownNodeParser` to parse the document to keep the structure of the document.


In [41]:
md_nodes = MarkdownNodeParser().get_nodes_from_documents(flat_reader_docs)