# Reader and Chunker

LionAGI offers a unified interface for loading and chunking documents that supports:

- plain text files reading and chunking
- `langchain` document loaders and text splitters
- `llamaindex` readers and node parsers
- self-defined readers and chunkers

In [1]:
from lionagi import ReaderType, ChunkerType
print('Reader Types:', list(ReaderType))
print('Chunker Types:', list(ChunkerType))

Reader Types: [<ReaderType.PLAIN: 'plain'>, <ReaderType.LANGCHAIN: 'langchain'>, <ReaderType.LLAMAINDEX: 'llama_index'>, <ReaderType.SELFDEFINED: 'self_defined'>]
Chunker Types: [<ChunkerType.PLAIN: 'plain'>, <ChunkerType.LANGCHAIN: 'langchain'>, <ChunkerType.LLAMAINDEX: 'llama_index'>, <ChunkerType.SELFDEFINED: 'self_defined'>]


In [2]:
file = "txt_examples/essay.txt"
file_path = "txt_examples"

with open('txt_examples/essay.txt') as f:
    essay = f.read()

## Readers

In [3]:
from lionagi import load

```python
load(reader: Union[str, Callable],
     reader_type=ReaderType.PLAIN,
     reader_args=[],
     reader_kwargs={},
     load_args=[],
     load_kwargs={},
     to_datanode: Union[bool, Callable] = True)
```

### ReaderType.PLAIN

Currently, we only supports text_reader under PLAIN.

In [4]:
'''
text_reader(reader_args, reader_kwargs)
'''

'\ntext_reader(reader_args, reader_kwargs)\n'

In [5]:
datanodes1 = load('text_reader', 
                  reader_args=[file_path, '.txt'], 
                  reader_kwargs={'recursive': False, 'flatten': True, 'clean_text':True})

# {'recursive': False, 'flatten': True, 'clean_text':True} are default settings
# if you do not wish to change, you can just call:
# datanodes1 = load('text_reader', reader_args=[file_path, '.txt'])

### ReaderType.LANGCHAIN

In [6]:
'''
Equivalent to:

langchain_loader = reader(*reader_args, **reader_kwargs)
docs = langchain_loader.load()
'''

'\nEquivalent to:\n\nlangchain_loader = reader(*reader_args, **reader_kwargs)\ndocs = langchain_loader.load()\n'

In [7]:
datanodes2 = load('TextLoader', 
                  reader_type='langchain', 
                  reader_args=[file])

# equivalent to

from langchain.document_loaders import TextLoader
datanodes2 =load(TextLoader, 
                 reader_type='langchain', 
                 reader_args=[file])

To keep Langchain document type, add to_datanode=False

### ReaderType.LLAMAINDEX

In [8]:
'''
Equivalent to:

loader = reader(*reader_args, **reader_kwargs)
docs = loader.load_data(*load_args, **load_kwargs)
'''

'\nEquivalent to:\n\nloader = reader(*reader_args, **reader_kwargs)\ndocs = loader.load_data(*load_args, **load_kwargs)\n'

In [9]:
datanodes3 = load('SimpleDirectoryReader', 
                  reader_type='llama_index', 
                  reader_args=[file_path])

# equivalent to

from llama_index import SimpleDirectoryReader
datanodes3 = load(SimpleDirectoryReader, 
                  reader_type='llama_index', 
                  reader_args=[file_path])

To keep LlamaIndex node type, add to_datanode=False

### ReaderType.SELFDEFINED

Self defined reader classes with `load()` function are also supported. For example:

In [10]:
class demo_reader:
    def __init__(self, file):
        self.file = file
    
    def load(self):
        with open(self.file) as f:
            return f.read()

In [11]:
datanodes4 = load(demo_reader, 
                  reader_type='self_defined', 
                  reader_args=[file], 
                  to_datanode=False)

To parse outputs to DataNodes, an explicitly defined parsing function is required. 

In [12]:
from lionagi import DataNode
def to_datanode(content):
    return DataNode(content=content)

In [13]:
datanodes5 = load(demo_reader, 
                  reader_type='self_defined', 
                  reader_args=[file], 
                  to_datanode=to_datanode)

## Chunker

In [14]:
from lionagi import chunk, datanodes_convert

```python
chunk(documents,
      chunker,
      chunker_type=ChunkerType.PLAIN,
      chunker_args=[],
      chunker_kwargs={},
      chunking_kwargs={},
      documents_convert_func=None,
      to_datanode: Union[bool, Callable] = True)
```

Document type needs to be supported by the selected chunker. If not, a converting function is needed in documents_convert_func.

In [15]:
'''
format:
# for langchain
documents = documents_convert_func(documents, 'langchain')

# for llama_index
documents = documents_convert_func(documents, 'llama_index')

datanodes_convert: avaliable to convert from List[DataNode] to List[Document] (langchain) or List[TextNode] (llama_index)
from lionagi.loaders import datanodes_convert
'''

"\nformat:\n# for langchain\ndocuments = documents_convert_func(documents, 'langchain')\n\n# for llama_index\ndocuments = documents_convert_func(documents, 'llama_index')\n\ndatanodes_convert: avaliable to convert from List[DataNode] to List[Document] (langchain) or List[TextNode] (llama_index)\nfrom lionagi.loaders import datanodes_convert\n"

### ChunkerType.PLAIN

Currently, we only supports text_chunker under PLAIN.

In [16]:
'''
text_chunker(documents, chunker_args, chunker_kwargs)
'''

'\ntext_chunker(documents, chunker_args, chunker_kwargs)\n'

In [17]:
datanodes6 = chunk(datanodes1, 
                   chunker='text_chunker', 
                   chunker_kwargs={'field': 'content', 'chunk_size': 1500, 'overlap': 0.1, 'threshold':200})

# {'field': 'content', 'chunk_size': 1500, 'overlap': 0.1, 'threshold':200} are default settings
# if you do not wish to change, you can just call:
# datanodes6 = chunk(datanodes1, chunker='text_chunker')

### ChunkerType.LANGCHAIN

In [18]:
'''
Equivalent to:

## if documents are strings
splitter = chunker(*chunker_args, **chunker_kwargs)
splitter.split_text(documents) 

## if documents are list of langchain Documents
splitter = chunker(*chunker_args, **chunker_kwargs)
splitter.split_documents(documents)
'''

'\nEquivalent to:\n\n## if documents are strings\nsplitter = chunker(*chunker_args, **chunker_kwargs)\nsplitter.split_text(documents) \n\n## if documents are list of langchain Documents\nsplitter = chunker(*chunker_args, **chunker_kwargs)\nsplitter.split_documents(documents)\n'

In [None]:
datanodes7 = chunk(essay, 
                   'CharacterTextSplitter', 
                   chunker_type='langchain', 
                   chunker_kwargs={'chunk_size':200, 'chunk_overlap': 20, 'length_function': len})

# equivalent to

from langchain.text_splitter import CharacterTextSplitter
datanodes7 = chunk(essay, 
                   CharacterTextSplitter, 
                   chunker_type='langchain', 
                   chunker_kwargs={'chunk_size':200, 'chunk_overlap': 20, 'length_function': len})


To keep Langchain document type, add to_datanode=False

### ChunkerType.LLAMAINDEX

In [20]:
'''
Equivalent to:

node_parser = chunker(*chunker_args, **chunker_kwargs)
nodes = node_parser.get_nodes_from_documents(documents, **chunking_kwargs)

or

node_parser = chunker.from_defaults(*chunker_args, **chunker_kwargs)
nodes = node_parser.get_nodes_from_documents(documents, **chunking_kwargs)
'''

'\nEquivalent to:\n\nnode_parser = chunker(*chunker_args, **chunker_kwargs)\nnodes = node_parser.get_nodes_from_documents(documents, **chunking_kwargs)\n\nor\n\nnode_parser = chunker.from_defaults(*chunker_args, **chunker_kwargs)\nnodes = node_parser.get_nodes_from_documents(documents, **chunking_kwargs)\n'

In [21]:
tnodes = load('SimpleDirectoryReader', reader_type = 'llama_index', reader_args=[file_path], to_datanode=False)
datanodes8 = chunk(tnodes, 
                   'SentenceSplitter', 
                   chunker_type='llama_index')

# equivalent to

from llama_index.text_splitter import SentenceSplitter
datanodes8 = chunk(tnodes, 
                   SentenceSplitter, 
                   chunker_type='llama_index')

To keep LlamaIndex node type, add to_datanode=False

### ChunkerType.SELFDEFINED

Self defined reader classes with `split()` function are also supported. For example:

In [22]:
class demo_chunker:
    def __init__(self, seperator):
        self.seperator = seperator
    
    def split(self, content):
        return content.split(self.seperator)

In [23]:
chunks = chunk(essay, 
               demo_chunker, 
               chunker_type = 'self_defined', 
               chunker_args=['.'], 
               to_datanode=False)

To parse outputs to DataNodes, an explicitly defined parsing function is required. 

In [24]:
from lionagi import DataNode
def to_datanode(content):
    dlist = []
    for c in content:
        dlist.append(DataNode(content=c))
    return dlist

In [25]:
datanodes9 = chunk(essay,
                   demo_chunker, 
                   chunker_type = 'self_defined', 
                   chunker_args=['.'], 
                   to_datanode=to_datanode)