# RAG Chunking Analysis - Self Chunking

Investigating the chunking of the documents and its effect on getting good results.

This notebook is an initial review of the chunking mechanism and creating manual chunks that align better with the document text - in this case making sure numbered and bullet pointed lists are kept with their preceding paragraph.

Additionally, it looks at token length of the chunks and uses that to determine how many chunks to retrieve to pass to the LLM for context. This helps to give as much as context as we can to the LLM.

This notebook uses the Mistral 7B Instruct LLM. Others will be tried separately.

We start with a short story created by ChatGPT and stored in 3 Microsoft Word documents.

In [4]:
# Read the documents

from llama_index.core import SimpleDirectoryReader

documents = SimpleDirectoryReader("./Data/").load_data()

In [5]:
documents

[Document(id_='b7ba673a-8e24-4aaa-9e6d-fcce22f8efb4', embedding=None, metadata={'file_name': 'Thundertooth Part 1.docx', 'file_path': 'Data/Thundertooth Part 1.docx', 'file_type': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document', 'file_size': 14575, 'creation_date': '2024-02-22', 'last_modified_date': '2024-02-22', 'last_accessed_date': '2024-02-22'}, excluded_embed_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], excluded_llm_metadata_keys=['file_name', 'file_type', 'file_size', 'creation_date', 'last_modified_date', 'last_accessed_date'], relationships={}, text='Thundertooth\n\n\n\nOnce upon a time, in a prehistoric land filled with dense forests and roaring rivers, there lived a dinosaur named Thundertooth. Thundertooth was no ordinary dinosaur; he possessed the rare ability to speak, a talent that set him apart from his ancient companions. One fateful day, as Thundertooth was basking in the 

We will use the GTE Large embeddings from HuggingFace

In [15]:
# Embeddings

from llama_index.embeddings.huggingface import HuggingFaceEmbedding

embed_model = HuggingFaceEmbedding(model_name="thenlper/gte-large", cache_folder=None)

config.json: 100%|██████████| 619/619 [00:00<00:00, 3.43MB/s]
model.safetensors: 100%|██████████| 670M/670M [00:57<00:00, 11.7MB/s] 
tokenizer_config.json: 100%|██████████| 342/342 [00:00<00:00, 2.74MB/s]
vocab.txt: 100%|██████████| 232k/232k [00:00<00:00, 584kB/s]
tokenizer.json: 100%|██████████| 712k/712k [00:00<00:00, 10.1MB/s]
special_tokens_map.json: 100%|██████████| 125/125 [00:00<00:00, 646kB/s]


Load the LLM, Mistral 7B Instruct (though could be any another for our purposes)

In [7]:
import torch

from llama_index.llms.llama_cpp import LlamaCPP
from llama_index.llms.llama_cpp.llama_utils import messages_to_prompt, completion_to_prompt
llm = LlamaCPP(
    model_url=None, # We'll load locally.
    model_path='./Models/mistral-7b-instruct-v0.1.Q6_K.gguf', # 6-bit model
    temperature=0.1,
    max_new_tokens=1024, # Increasing to support longer responses
    context_window=8192, # Mistral7B has an 8K context-window
    generate_kwargs={},
    # set to at least 1 to use GPU
    model_kwargs={"n_gpu_layers": 33}, # 33 was all that was needed for this model and the RTX 3090
    messages_to_prompt=messages_to_prompt,
    completion_to_prompt=completion_to_prompt,
    verbose=True
)

ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
llama_model_loader: loaded meta data with 20 key-value pairs and 291 tensors from ./Models/mistral-7b-instruct-v0.1.Q6_K.gguf (version GGUF V2)
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.name str              = mistralai_mistral-7b-instruct-v0.1
llama_model_loader: - kv   2:                       llama.context_length u32              = 32768
llama_model_loader: - kv   3:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv   4:                          llama.block_count u32              = 32
llama_model_loader: - kv   5:               

Now we create and set the tokenizer which allows us to get the tokens used in text. This needs to match the LLM used.

In [8]:
# Tokenizer must match the model we're using

from llama_index.core import set_global_tokenizer

# huggingface
from transformers import AutoTokenizer
set_global_tokenizer(
  AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1").encode
)

We create a function to split text into paragraphs but keep numbered sections and bullet points together. This is suitable for the document because it has numbered and bulleted points - this would need to be changed to suit the document.

In [9]:
import re

# Define the regular expression pattern for splitting paragraphs
para_split_pattern = re.compile(r'\n\n\n')

# Splits a document's text into paragraphs but if it has numbered or bulleted points, they will be included with the paragraph before it.
def split_text_into_paragraphs(text):


    # Use the pattern to split the text into paragraphs
    paragraphs = para_split_pattern.split(text)

    # Combine paragraphs that should not be split
    combined_paragraphs = [paragraphs[0]]

    for p in paragraphs[1:]:
        # Check if the paragraph starts with a number or a dash and, if so, concatenate it to the previous paragraph so we keep them all in one chunk

        # Strip out any leading new lines
        p = p.lstrip('\n')

        if p and (p[0].isdigit() or p[0] == '-'):
            combined_paragraphs[-1] += '\n\n\n' + p
        else:
            combined_paragraphs.append(p)

    # Remove empty strings from the result
    combined_paragraphs = [p.strip() for p in combined_paragraphs if p.strip()]

    return combined_paragraphs

Create nodes from the paragraphs that we've carefully split up, counting the paragraphs so we know what kind of token length we're working with.

In [11]:
from llama_index.core.utilities.token_counting import TokenCounter
from llama_index.core.schema import TextNode
import uuid

token_counter = TokenCounter() # Uses the global tokenizer set above, which should match the LLM

paragraph_separator = "\n\n\n"

# Stores the maximum length of a paragraph, in tokens
max_paragraph_tokens = 0

# Total tokens, used to determine average
total_paragraph_tokens = 0

# Nodes
paragraph_nodes = []

# Loop through the documents, splitting each into paragraphs and checking the number of tokens per paragraph
for document in documents:

    paragraph_token_lens = []
    # paragraphs = document.text.split(paragraph_separator)
    paragraphs = split_text_into_paragraphs(document.text)
    print(f"Document {document.metadata['file_name']} has {len(paragraphs)} paragraphs, token lengths:")
    for paragraph in paragraphs:
        token_count = token_counter.get_string_tokens(paragraph)
        paragraph_token_lens.append(token_count)
        # print(f"Paragraph tokens: {token_count}")

        if token_count > max_paragraph_tokens:
            max_paragraph_tokens = token_count

        total_paragraph_tokens = total_paragraph_tokens + token_count

        # Create and add the node from the paragraph
        # include metadata we can use for citations
        node = TextNode(text=paragraph, id=uuid.uuid4())
        node.metadata["document_name"] =document.metadata["file_name"]
        node.metadata["token_count"] = token_count
        paragraph_nodes.append(node)

    print(paragraph_token_lens)

print(f"\n** The maximum paragraph tokens is {max_paragraph_tokens} **")

average_paragraph_tokens = int(total_paragraph_tokens / len(paragraph_nodes))
print(f"\n** The average paragraph's token count is {average_paragraph_tokens} **")

print(f"\n** Created {len(paragraph_nodes)} nodes **")
# paragraph_nodes


Document Thundertooth Part 1.docx has 12 paragraphs, token lengths:
[5, 102, 78, 98, 70, 88, 72, 29, 73, 56, 67, 118]
Document Thundertooth Part 2.docx has 10 paragraphs, token lengths:
[5, 97, 79, 71, 93, 61, 75, 73, 74, 76]
Document Thundertooth Part 3.docx has 10 paragraphs, token lengths:
[5, 92, 58, 62, 228, 80, 106, 82, 73, 97]

** The maximum paragraph tokens is 228 **

** The average paragraph's token count is 76 **

** Created 32 nodes **


We can see the maximum paragraph token count is 228 and the average is 76. This helps us work out how many paragraphs we can return for the LLM to use for RAG.

If we have 1000 tokens to feed into the LLM we can return 10 or so pargraphs, that gives a good amount of context for the LLM to work with.

In [16]:
from llama_index.core import Settings

Settings.llm = llm
Settings.embed_model = embed_model
# chunk_size=500, # The token chunk size for each chunk. ** we are creating chunks automatically so no need to set this **
# chunk_overlap=5, # The token overlap of each chunk when splitting. ** we are creating chunks automatically so no need to set this **
Settings.num_output = 1000 # Let's allow up to 1000 tokens to be output

In [17]:
# Let's look at the chunking and prompting parameters from our service_context

print("service_context.node_parser:")
print("Chunk Overlap:",Settings.node_parser.chunk_overlap)
print("Chunk Size:",Settings.node_parser.chunk_size)
print("Paragraph Separator:",Settings.node_parser.paragraph_separator.replace("\n","\\n"))
print("Secondary Chunking RegEx:",Settings.node_parser.secondary_chunking_regex)
print("Include Metadata:",Settings.node_parser.include_metadata)

print("\nservice_context.prompt_helper:")
print("Context Window:", Settings.prompt_helper.context_window) # The maximum context size that will get sent to the LLM.
print("Chunk Overlap Ratio:", Settings.prompt_helper.chunk_overlap_ratio) # The percentage token amount that each chunk should overlap.
print("Chunk Size Limit:", Settings.prompt_helper.chunk_size_limit) # The maximum size of a chunk.
print("Chunk Separator: '", Settings.prompt_helper.separator, "'") # The separator when chunking tokens.

service_context.node_parser:
Chunk Overlap: 200
Chunk Size: 1024
Paragraph Separator: \n\n\n
Secondary Chunking RegEx: [^,.;。？！]+[,.;。？！]?
Include Metadata: True

service_context.prompt_helper:
Context Window: 8192
Chunk Overlap Ratio: 0.1
Chunk Size Limit: None
Chunk Separator: '   '


#### Create the vector store

Importantly this is created from the nodes we have created, rather than being chunked automatically

In [18]:
# Now we need to index these nodes, putting them into the vector store

from llama_index.core import VectorStoreIndex

index_from_nodes = VectorStoreIndex(paragraph_nodes, show_progress=True)

Generating embeddings: 100%|██████████| 32/32 [00:00<00:00, 35.96it/s]


In [20]:
# List "documents" stored in our vector store - we created these and stored them

for i, document_id in enumerate(index_from_nodes.docstore.docs):
    document = index_from_nodes.docstore.get_document(document_id)
    print(f"--- {i} ---\n{document.extra_info['document_name']}")
    print(f"id: {document.node_id}")
    print(f"characters: {len(document.text)}")
    print(f"tokens: {document.extra_info['token_count']}")
    print(f"[Text Start]\n{document.text}\n[Text End]\n")

--- 0 ---
Thundertooth Part 1.docx
id: da71c044-785c-44e8-8f0f-a8449768d62d
characters: 12
tokens: 5
[Text Start]
Thundertooth
[Text End]

--- 1 ---
Thundertooth Part 1.docx
id: b9ef759e-4943-45eb-a2a2-f3f31fb8ab12
characters: 428
tokens: 102
[Text Start]
Once upon a time, in a prehistoric land filled with dense forests and roaring rivers, there lived a dinosaur named Thundertooth. Thundertooth was no ordinary dinosaur; he possessed the rare ability to speak, a talent that set him apart from his ancient companions. One fateful day, as Thundertooth was basking in the warmth of the sun, a mysterious portal opened before him, and he found himself hurtling through time and space.
[Text End]

--- 2 ---
Thundertooth Part 1.docx
id: 00956141-f2ed-415c-972f-d31f8b3ec76f
characters: 330
tokens: 78
[Text Start]
As the dazzling vortex subsided, Thundertooth opened his eyes to a world unlike anything he had ever seen. The air was filled with the hum of engines, and towering structures reached towa

How many paragraphs should we return? Let's assume we have 1,000 tokens to work with and we know the average paragraph token length is 76.

In [21]:
import math

working_context_token_length = 1000     # Let's allow 1,000 tokens for context

# Round up to the closest 50 tokens for the average, resulting in 100 tokens
return_paragraphs = int(working_context_token_length / (math.ceil(average_paragraph_tokens / 50) * 50))

print(f"We'll return {return_paragraphs} paragraphs for context")

We'll return 10 paragraphs for context


Test time. We run a question and check if it returned the number of paragraphs we allowed it to and whether it could answer the question correctly.

In [22]:
from llama_index.core.query_engine import CitationQueryEngine
query_engine = CitationQueryEngine.from_args(
    index_from_nodes,
    similarity_top_k=return_paragraphs,
)

# For citations we get the document info
DB_DOC_ID_KEY = "db_document_id"

test_question = "Did they have any children? If so, what were their names?"

queryQuestion = "<s>[INST] You are a technology specialist. Answer questions in a positive, helpful and empathetic way. Answer the following question: " + test_question + " [/INST]"

response = query_engine.query(queryQuestion)

for index, node in enumerate(response.source_nodes, start=1):
    print(f"{index}/{len(response.source_nodes)}: |{node.node.metadata['document_name']}| {node.node.get_text()}")

test_response = str(response.response).strip()


llama_print_timings:        load time =     139.50 ms
llama_print_timings:      sample time =      10.36 ms /    39 runs   (    0.27 ms per token,  3765.21 tokens per second)
llama_print_timings: prompt eval time =     428.18 ms /  1419 tokens (    0.30 ms per token,  3314.00 tokens per second)
llama_print_timings:        eval time =     433.16 ms /    38 runs   (   11.40 ms per token,    87.73 tokens per second)
llama_print_timings:       total time =     941.98 ms /  1457 tokens


1/10: |Thundertooth Part 2.docx| Source 1:
Thundertooth and Seraphina reveled in the joy of parenthood, watching their children grow and flourish in the futuristic landscape they now called home. The family became an integral part of the city's fabric, not only through the widgets produced in their factory but also through the positive impact each member had on the community.

2/10: |Thundertooth Part 2.docx| Source 2:
Lumina: The eldest of Thundertooth's children, Lumina inherited her mother's intelligence and her father's sense of wonder. With sparkling scales that emitted a soft glow, Lumina had the ability to generate light at will. She became fascinated with technology, often spending hours tinkering with gadgets and inventing new ways to enhance the widgets produced in the family's factory.

3/10: |Thundertooth Part 2.docx| Source 3:
Embraced by the futuristic city and its inhabitants, Thundertooth found a sense of purpose beyond merely satisfying his hunger. Inspired by the adva

And here's the question and result.

In [23]:
print(f"\n\nQuestion:\n{test_question}")
print(f"\n\nResponse:\n{test_response}")



Question:
Did they have any children? If so, what were their names?


Response:
Based on the provided sources, Thundertooth and Seraphina had four children together. Their names were Lumina, Echo, Sapphire, and Ignis.


Looks good!