## Chunking

### Basic Syntax without external modules
This code snippet processes data files located in a directory named `data`. It iterates through each file in the directory, reads its content, and then divides the content into smaller chunks of size `512 bytes`. Each chunk is then processed by a function called `process`, which could perform various analyses or transformations on the data. The script uses basic file handling techniques with Python's built-in `os` module to navigate directories and read files.

In [None]:
def process(chunk):
    print(chunk)

def tokenize_documents(strings, max_tokens_per_chunk=50):
    """
        Tokenizes a list of strings into individual words and groups them into chunks based on max token limit for each string.
    """
    all_chunks = []

    for text in strings:
        tokens = text.split(' ')
        current_chunk = []
        chunks = []

        for token in tokens:
            if len(current_chunk) + len(token) + 1 <= max_tokens_per_chunk:  # +1 for space
                current_chunk.append(token)
            else:
                chunks.append(' '.join(current_chunk))
                current_chunk = [token]

        if current_chunk:
            chunks.append(' '.join(current_chunk))

        all_chunks.extend(chunks)

    return all_chunks


# List of strings to process
string_list = [
    "This is a sample string that has been expanded to become much longer. It includes various words and phrases designed to make up the required word count. The goal here is to generate content that is coherent yet lengthy, showcasing the ability to produce substantial amounts of text on demand.",
    "Another example for testing purposes, this string aims to achieve similar length as the previous one. It features a diverse range of vocabulary and sentence structures to ensure it meets the 1000-word requirement while maintaining readability and relevance."
]

# Tokenize data into smaller parts
chunks = tokenize_documents(string_list)

process(chunks)

['This is a sample string that has been expanded to become much longer. It includes various words and phrases designed to make up the required word count. The goal here is to generate content that is coherent yet lengthy, showcasing the ability to', 'produce substantial amounts of text on demand.', 'Another example for testing purposes, this string aims to achieve similar length as the previous one. It features a diverse range of vocabulary and sentence structures to ensure it meets the 1000-word requirement while maintaining readability and relevance.']


### Adding Recursive Text Splitting
This enhanced Python code snippet still processes data files located in a directory named `data`. It introduces a more sophisticated tokenization function called `tokenize` that splits text into smaller, meaningful chunks based on specified delimiter characters such as spaces, periods, commas, semicolons, and newlines. This approach ensures that each chunk maintains linguistic coherence by respecting sentence boundaries and punctuation marks. The script iterates through each file in the directory, reads its content, tokenizes it using the `tokenize` function, and then processes each token using a function called `process`.

In [None]:
def process(chunk):
    print(' '.join(chunk))

def tokenize_documents(strings, delimiters, max_tokens_per_chunk=50):
    """
        Tokenizes a list of strings into smaller chunks based on specified delimiters and maximum token limit per chunk.
        Uses recursive splitting to handle multiple delimiters and ensures each chunk maintains linguistic coherence.
    """
    
    def recursive_split(s, delims):
        if not delims:
            return [s]
        delim = delims[0]
        parts = s.split(delim)
        result = []
        for part in parts:
            if(part):
                result.extend(recursive_split(part, delims[1:]))
        return result
    
    def create_chunks(tokens, max_tokens):
        chunks = []
        current_chunk = []
        current_length = 0
        
        for token in tokens:
            token_length = len(token)
            if current_length + token_length > max_tokens:
                chunks.append(current_chunk)
                current_chunk = [token]
                current_length = token_length
            else:
                current_chunk.append(token)
                current_length += token_length
        
        if current_chunk:
            chunks.append(current_chunk)
        
        return chunks
    
    all_chunks = []
    for string in strings:
        tokens = recursive_split(string, delimiters)
        chunks = create_chunks(tokens, max_tokens_per_chunk)
        all_chunks.extend(chunks)
    
    return all_chunks

# List of strings to process
string_list = [
    "This is a sample string that has been expanded to become much longer. It includes various words and phrases designed to make up the required word count. The goal here is to generate content that is coherent yet lengthy, showcasing the ability to produce substantial amounts of text on demand.",
    "Another example for testing purposes, this string aims to achieve similar length as the previous one. It features a diverse range of vocabulary and sentence structures to ensure it meets the 1000-word requirement while maintaining readability and relevance."
]

# Define characters to split the text
split_chars = [' ', '.', ',', ';', '\n']  # Example characters
# Tokenize data into smaller parts
tokens = tokenize_documents(string_list, delimiters=split_chars)

for token in tokens:
    # Process each token (e.g. perform some analysis or transformation)
    processed_token = process(token)

This is a sample string that has been expanded to become much
longer It includes various words and phrases designed to
make up the required word count The goal here is to generate
content that is coherent yet lengthy showcasing the
ability to produce substantial amounts of text on demand
Another example for testing purposes this string aims to
achieve similar length as the previous one It features a
diverse range of vocabulary and sentence structures to
ensure it meets the 1000-word requirement while
maintaining readability and relevance


### Using LangChain for simplicity
This Python code snippet uses the LangChain library to enhance the chunking process. It leverages the `DirectoryLoader` to load data from a directory named `data`, splitting each document into manageable parts. The `RecursiveCharacterTextSplitter` is then used to create chunks of size 512 characters, with an overlap of 20% (102 characters) to ensure continuity between chunks. By specifying multiple separators like paragraphs, lines, spaces, and characters, the module can intelligently breaks down text while maintaining linguistic coherence. This approach simplifies the chunking process and improves the quality of the resulting chunks.

In [18]:
!pip install langchain &> /dev/null

from langchain_core.documents import Document
from langchain.text_splitter import RecursiveCharacterTextSplitter

def process(chunk):
    print(chunk)

# Load data from directory 'data' under the same directorty
string_list = [
    "This is a sample string that has been expanded to become much longer. It includes various words and phrases designed to make up the required word count. The goal here is to generate content that is coherent yet lengthy, showcasing the ability to produce substantial amounts of text on demand.",
    "Another example for testing purposes, this string aims to achieve similar length as the previous one. It features a diverse range of vocabulary and sentence structures to ensure it meets the 1000-word requirement while maintaining readability and relevance."
]

CHUNK_SIZE = 50
CHUNK_OVERLAP = int(CHUNK_SIZE * 0.2)

text_splitter = RecursiveCharacterTextSplitter(
    separators=[' ', '.', ',', ';', '\n'],
    chunk_size=CHUNK_SIZE,
    chunk_overlap=CHUNK_OVERLAP
)

chunks = text_splitter.split_documents([Document(page_content=item) for item in string_list])

for chunk in chunks:
    print(chunk.page_content)


This is a sample string that has been expanded to
to become much longer. It includes various words
words and phrases designed to make up the
up the required word count. The goal here is to
is to generate content that is coherent yet
yet lengthy, showcasing the ability to produce
produce substantial amounts of text on demand.
Another example for testing purposes, this string
string aims to achieve similar length as the
as the previous one. It features a diverse range
range of vocabulary and sentence structures to
to ensure it meets the 1000-word requirement
while maintaining readability and relevance.
