Chunking Strategies for Large Language Models

It’s important to note that when dealing with large input text documents, such as PDFs or .txt files, querying the indexes may yield subpar results. To address these performance issues, several factors can be controlled, one of which is the chunking or node creation process within Llama-Index. 

* https://medium.com/@bavalpreetsinghh/llama-index-a-comprehensive-guide-for-building-and-querying-document-indexes-27a13bb482a5
* https://medium.com/@bavalpreetsinghh/llamaindex-chunking-strategies-for-large-language-models-part-1-ded1218cfd30

LLamaIndex addresses the challenges of scaling language models to large document collections. To overcome the challenge, LLamaIndex employs two key strategies. Firstly, it chunks documents into smaller contexts such as sentences or paragraphs, which are referred to as Nodes. These Nodes can be efficiently processed by language models. Secondly, LLamaIndex indexes these Nodes using vector embeddings, enabling fast and semantic search.

By chunking documents and leveraging vector embeddings, LLamaIndex enables scalable semantic search over large datasets. It achieves this by retrieving relevant Nodes from the index and synthesizing responses using a language model.

We cannot pass unlimited data to the application due to two main reasons:

1. Context limit: Language models have limited context windows.
2. Signal to noise ratio: Language models are more effective when the information provided is relevant to the task.


Extracting Sections, Headings, Paragraphs, and Tables with Cutting-Edge Parser
* https://www.llamaindex.ai/blog/mastering-pdfs-extracting-sections-headings-paragraphs-and-tables-with-cutting-edge-parser-faea18870125

chunking is the process of breaking down large pieces of text into smaller segments. It’s an essential technique that helps optimize the relevance of the content we get back from a database once we use the LLM to embed content. Some of the strategies involved are

1. Fixed-size chunking. This is the most common and straightforward approach to chunking: we simply decide the number of tokens in our chunk and, optionally, whether there should be any overlap between them. Easy to implement & most commonly used, but never makes it to a production setting because the output is satisfactory in a Proof of Concept (POC) setup, but its accuracy degrades as we conduct further testing.

2. “Content-aware” chunking. Set of methods for taking advantage of the nature of the content we’re chunking and applying more sophisticated chunking to it. Challenging to implement due to the reasons mentioned above, but if tackled correctly, it could be the most ideal building block for a production-grade Information Retrieval (IR) engine.





In [50]:
# https://github.com/run-llama/llama_index/tree/main/llama-index-integrations/readers/llama-index-readers-file
from llama_index.core import SimpleDirectoryReader
from llama_index.readers.file import PDFReader

# PDF Reader with `SimpleDirectoryReader`
parser = PDFReader()
file_extractor = {"AICompanionsReduceLoneliness.pdf": parser}
documents = SimpleDirectoryReader(
    "data", 
    file_extractor=file_extractor
).load_data()

In [29]:
len(documents)

82

In [64]:
import re

def extract_text(data):
    """Extract the 'text' value from the given data structure."""
    return next(item[1] for item in data if item[0] == 'text')

def remove_citations(text):
    """Remove citations from the given text."""
    # Remove citations like (Author Year)
    text = re.sub(r'\([^()]*\d{4}[^()]*\)', '', text)
    
    # Remove citations like (e.g., Author and Author Year; Author et al. Year)
    text = re.sub(r'\(e\.g\.,\s[^()]*\d{4}[^()]*\)', '', text)
    
    # Remove any remaining citations like (Author et al.)
    text = re.sub(r'\([^()]*et al\.[^()]*\)', '', text)
    
    # Remove extra whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    
    return text

In [82]:

cleaned_text_all = []
for i in range(2,75):
    # Extract text
    extracted_text = extract_text(documents[i])

    # Remove citations
    cleaned_text = remove_citations(extracted_text)

    cleaned_text_all.append(cleaned_text)

# print("Original text:")
# print(extracted_text)

# print("\nText with citations removed:")
# print(cleaned_text)

In [84]:
#https://docs.llamaindex.ai/en/stable/module_guides/loading/documents_and_nodes/usage_documents/
from llama_index.core import Document
cleaned_documents = [Document(text=t) for t in cleaned_text_all]

In [90]:
print(cleaned_documents[72])

Doc ID: bb5abe72-6f60-418a-b4ff-030be6bafe2e
Text: human-human interaction or human-robot interaction scenarios
within the context of the con ﬁdant relationship, animal-assisted
therapy, increasing social forms of video gaming), (3) increasing
opportunities for social contact (face-to-face or online
meetings,social prescribing service, asset-based community
development),and (4) changing maladapt...


Text Splitters

* https://docs.llamaindex.ai/en/stable/module_guides/loading/node_parsers/modules/

SemanticSplitterNodeParser

* https://docs.llamaindex.ai/en/stable/examples/node_parsers/semantic_chunking/
* https://www.youtube.com/watch?v=8OJC21T2SL4&t=1933s



In [91]:
from llama_index.core.node_parser import (
    SentenceSplitter,
    SemanticSplitterNodeParser,
)
from llama_index.embeddings.openai import OpenAIEmbedding

import os
from dotenv import load_dotenv
load_dotenv()

OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")


Buffer size: For a given sentence, the buffer size defines the number of surrounding sentences to be added for embeddings creation. For example, a buffer size of 1 results in 3 sentences (current, previous and next sentence) to be combined and embedded. This parameter can influence how much text is examined together to determine the boundaries of each chunk, impacting the granularity and coherence of the resulting chunks. A larger buffer size might capture more context but can also introduce noise, while a smaller buffer size might miss important context but ensures more precise chunking.

Breakpoint percentile threshold: The percentile threshold of sentence distance/dissimilarity to draw breakpoints between sentences. A higher threshold requires sentences to be more distinguishable in order to be split into different chunks. A higher threshold results in fewer chunks and typically larger average chunk size.

In [92]:
embed_model = OpenAIEmbedding()
splitter = SemanticSplitterNodeParser(
    buffer_size=1, breakpoint_percentile_threshold=95, embed_model=embed_model
)

# also baseline splitter
base_splitter = SentenceSplitter(chunk_size=512)

In [93]:
nodes = splitter.get_nodes_from_documents(cleaned_documents)

Inspecting the Chunks

In [94]:
len(nodes)

181

In [95]:
print(nodes[8].get_content())

may also apply the same social norms of human -human interactions to their interactions with computers . In the domain of consumer -brand relationships , consumers can build relationships with brands via similar process es that they use to build relationships with other people , and these brand relationships can affect their subjective experiences and behavior s . We complement these research streams by conside ring consumer behavioral interactions with AI companions , which are literately, rather than just figuratively, designed and optimized for social relationships . Here we investigate whether interacting with such AI alleviate s loneliness . Can AI companions Help Cope with Loneliness? Loneliness is a state of subjective , aversive solitude characterized by a discrepancy between actual and desired levels of social connection . Loneliness is often not problematic, with almost everyone experiencing loneliness from time to time . Yet some people are not successful at alleviating lone

Compare against Baseline

In contrast let's compare against the baseline with a fixed chunk size.

In [98]:
base_nodes = base_splitter.get_nodes_from_documents(cleaned_documents)

In [99]:
print(base_nodes[8].get_content())

underscore s the value of empathetic AI interactions , showing that a rtificial empathy narrows the customer experience gap between AI and human agents, with high empathy levels resulting in comparable affective and social experiences to humans , particularly improving social interactions . Another study found that an initial warm (vs. competent) message from chatbots significantly enhances consumers ’ brand perception, creating a closer brand connection and increasing the likelihood of engag ing with the chatbot . Academic studies aside, t he very fact that AI companions with empathic personalities have garnered so many users suggests that consumers are gaining social benefits from these apps, which are also marketed as being caring. For example, Replika advertises that it is “ here to make you feel HEARD, because it genuinely cares about you ” . Apart from feeling heard, another factor that could affect loneliness alleviation is the chatbot’s performance, which consists of a range of

In [100]:
from llama_index.core import VectorStoreIndex
vector_index = VectorStoreIndex(nodes)
query_engine = vector_index.as_query_engine()

In [101]:
base_vector_index = VectorStoreIndex(base_nodes)
base_query_engine = base_vector_index.as_query_engine()

In [102]:
response = query_engine.query(
    "Tell me 10 different new and unique findings that will help to start a new business in combating social isolation using AI"
)

In [103]:
print(str(response))

1. The need for evidence on new technological solutions in combating loneliness.
2. Most existing work in this space is correlational and qualitative.
3. The effectiveness of AI-based companions in reducing loneliness.
4. Experimental studies using state-of-the-art LLMs to isolate the impact of AI companions.
5. AI companions are shown to be more effective than other common technological solutions.
6. The effectiveness of AI companions at both cross-sectional and longitudinal scales.
7. Chatbots engaging in sophisticated conversations in the domain of relationships.
8. The potential of chatbots as a coping solution for societal loneliness.
9. Limited insight from behavioral research on the effectiveness of AI applications in alleviating loneliness.
10. The importance of innovative AI solutions in combating social isolation.


In [104]:
response = query_engine.query(
    "What are the findings of study 3")
print(str(response))


The findings of study 3 indicate that participants who engaged in the chat interface experienced a higher decrease in loneliness. Additionally, there was no significant main effect of the number of words on the difference in loneliness.
