## Chunking

### Basic Syntax without external modules

This code snippet processes data files located in a directory named `data`. It iterates through each file in the directory, reads its content, and then divides the content into smaller chunks of size `512 bytes`. Each chunk is then processed by a function called `process`, which could perform various analyses or transformations on the data. The script uses basic file handling techniques with Python's built-in `os` module to navigate directories and read files.

In [None]:
import os

data_directory = 'data'

for filename in os.listdir(data_directory):
    filepath = os.path.join(data_directory, filename)

    # Load data from file
    with open(filepath, 'r') as f:
        data = f.read()

    # Chunk data into smaller parts
    chunk_size = 512
    for i in range(0, len(data), chunk_size):
        chunk = data[i:i+chunk_size]

        # Process chunk (e.g. perform some analysis or transformation)
        processed_chunk = process(chunk)

### Adding Recursive Text Splitting

This enhanced Python code snippet still processes data files located in a directory named `data`. It introduces a more sophisticated tokenization function called `tokenize` that splits text into smaller, meaningful chunks based on specified delimiter characters such as spaces, periods, commas, semicolons, and newlines. This approach ensures that each chunk maintains linguistic coherence by respecting sentence boundaries and punctuation marks. The script iterates through each file in the directory, reads its content, tokenizes it using the `tokenize` function, and then processes each token using a function called `process`.

In [None]:
import os

data_directory = 'data'

def tokenize(text, max_token_size=512, split_chars=None):
    if len(text) <= max_token_size:
        return [text]
    
    # Split the text based on specified characters
    if split_chars is not None and any(char in text for char in split_chars):
        for char in split_chars:
            if char in text:
                parts = text.split(char)
                tokens = []
                for part in parts:
                    tokens.extend(tokenize(part, max_token_size, split_chars))
                return tokens
    
    mid_point = len(text) // 2
    left_tokens = tokenize(text[:mid_point], max_token_size, split_chars)
    right_tokens = tokenize(text[mid_point:], max_token_size, split_chars)
    
    return left_tokens + right_tokens

for filename in os.listdir(data_directory):
    filepath = os.path.join(data_directory, filename)

    # Load data from file
    with open(filepath, 'r') as f:
        data = f.read()

    # Define characters to split the text
    split_chars = [' ', '.', ',', ';', '\n']  # Example characters

    # Tokenize data into smaller parts
    tokens = tokenize(data, split_chars=split_chars)
    
    for token in tokens:
        # Process each token (e.g. perform some analysis or transformation)
        processed_token = process(token)

### Using LangChain for simplicity

This Python code snippet uses the LangChain library to enhance the chunking process. It leverages the `DirectoryLoader` to load data from a directory named `data`, splitting each document into manageable parts. The `RecursiveCharacterTextSplitter` is then used to create chunks of size 512 characters, with an overlap of 20% (102 characters) to ensure continuity between chunks. By specifying multiple separators like paragraphs, lines, spaces, and characters, the module can intelligently breaks down text while maintaining linguistic coherence. This approach simplifies the chunking process and improves the quality of the resulting chunks.

In [None]:
from langchain.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Load data from directory 'data' under the same directorty
loader = DirectoryLoader('data')

documents = loader.load_and_split()

CHUNK_SIZE = 512
CHUNK_OVERLAP = int(CHUNK_SIZE * 0.2)

text_splitter = RecursiveCharacterTextSplitter(
    separators=["\n\n", "\n", " ", ""],
    chunk_size=CHUNK_SIZE,
    chunk_overlap=CHUNK_OVERLAP
)