-------------------------------
#### Understanding the simple text splitters
------------------------------

- **Text splitters** break down the document into **smaller pieces** at the raw text level.
- They are useful when the content has a **flat structure** and does not come in a specific format.

#### SimpleFileNodeParser

- **SimpleFileNodeParser** automatically selects the appropriate node parser based on file types.
- It can handle and transform various file formats into nodes, simplifying the interaction with different types of content.
- The file formats it can manage include:
  - **PDFs**
  - **DOCX** (Word documents)
  - **CSVs** (Comma-Separated Values)
  - **Plain text files**

In [1]:
from llama_index.readers.file import FlatReader

In [2]:
from pathlib import Path

In [3]:
reader = FlatReader()

In [4]:
document = reader.load_data(Path("files/sample_document1.txt"))

In [5]:
document

[Document(id_='2e565669-1dee-4fc5-8cb8-f32fcc1c078b', embedding=None, metadata={'filename': 'sample_document1.txt', 'extension': '.txt'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text="In ancient Rome, the city of Rome itself was the heart of the vast Roman Empire. It was known for its grand architecture, including iconic structures like the Colosseum and the Pantheon. The Romans were skilled engineers and builders, creating an extensive network of roads, aqueducts, and bridges that connected their far-reaching territories. The Roman Republic, with its Senate and elected officials, gave rise to the famous Roman legions, which conquered vast lands and brought them under Roman rule. The Roman civilization's influence on art, law, and governance can still be seen in modern societies today.", mimetype='text/plain', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\n\n{content}', metadata_template='{key}: {value}', metadata_sepera

In [6]:
print(f"Metadata: {document[0].metadata}")
print(f"Text: {document[0].text}")

Metadata: {'filename': 'sample_document1.txt', 'extension': '.txt'}
Text: In ancient Rome, the city of Rome itself was the heart of the vast Roman Empire. It was known for its grand architecture, including iconic structures like the Colosseum and the Pantheon. The Romans were skilled engineers and builders, creating an extensive network of roads, aqueducts, and bridges that connected their far-reaching territories. The Roman Republic, with its Senate and elected officials, gave rise to the famous Roman legions, which conquered vast lands and brought them under Roman rule. The Roman civilization's influence on art, law, and governance can still be seen in modern societies today.


#### JSONNodeParser

This parser is specialized in processing and querying structured data in JSON format. In a similar
way to the Markdown parser, the JSON parser can be used like this:

In [7]:
from llama_index.core.node_parser import JSONNodeParser
from llama_index.readers.file import FlatReader
from pathlib import Path

In [8]:
reader = FlatReader()
document = reader.load_data(Path("files/others/sample.json"))

In [9]:
json_parser = JSONNodeParser.from_defaults()
nodes = json_parser.get_nodes_from_documents(document)

In [10]:
for node in nodes:
    print(f"Metadata {node.metadata} \nText: {node.text}")

Metadata {'filename': 'sample.json', 'extension': '.json'} 
Text: quiz sport q1 question Which one is correct team name in NBA?
quiz sport q1 options New York Bulls
quiz sport q1 options Los Angeles Kings
quiz sport q1 options Golden State Warriros
quiz sport q1 options Huston Rocket
quiz sport q1 answer Huston Rocket
quiz maths q1 question 5 + 7 = ?
quiz maths q1 options 10
quiz maths q1 options 11
quiz maths q1 options 12
quiz maths q1 options 13
quiz maths q1 answer 12
quiz maths q2 question 12 - 8 = ?
quiz maths q2 options 1
quiz maths q2 options 2
quiz maths q2 options 3
quiz maths q2 options 4
quiz maths q2 answer 4
