# Exploring Text Chunking Strategies for RAG

**Objective:** This notebook explores and compares various text chunking strategies, a critical preprocessing step for preparing data for Retrieval-Augmented Generation (RAG) and vector search. We will fetch text from the "Pro Git" book and process it using different methods before ingesting it into a Weaviate vector database.

### Chunking Strategies Covered:
* **Fixed-Size Chunking**: Simple splitting by a fixed word count.
* **Chunking with Overlap**: Fixed-size chunks with overlapping content to preserve context across boundaries.
* **Variable-Size (Semantic) Chunking**: Splitting based on document structure, like paragraphs or section headers.
* **Hybrid Strategy**: A mixed approach combining semantic splitting with a minimum chunk size to balance context and uniformity.

In [1]:
from typing import List
import requests
import re
import weaviate
from weaviate.classes.config import Configure, Property, DataType, Tokenization
from weaviate.util import generate_uuid5
import tqdm
from weaviate.classes.query import Filter

In [2]:
url = "https://raw.githubusercontent.com/progit/progit2/main/book/01-introduction/sections/what-is-git.asc"
source_text = requests.get(url).text

In [3]:
print(source_text[:1000])

[[what_is_git_section]]
=== What is Git?

So, what is Git in a nutshell?
This is an important section to absorb, because if you understand what Git is and the fundamentals of how it works, then using Git effectively will probably be much easier for you.
As you learn Git, try to clear your mind of the things you may know about other VCSs, such as CVS, Subversion or Perforce -- doing so will help you avoid subtle confusion when using the tool.
Even though Git's user interface is fairly similar to these other VCSs, Git stores and thinks about information in a very different way, and understanding these differences will help you avoid becoming confused while using it.(((Subversion)))(((Perforce)))

==== Snapshots, Not Differences

The major difference between Git and any other VCS (Subversion and friends included) is the way Git thinks about its data.
Conceptually, most other systems store information as a list of file-based changes.
These other systems (CVS, Subversion, Perforce, and so o

In [4]:
print(f"There are about {len(source_text.split())} words in this chapter. Depending on how our LLM tokenizes words, you'd expect roughly {round(len(source_text.split())*1.3)} tokens.")

There are about 1403 words in this chapter. Depending on how our LLM tokenizes words, you'd expect roughly 1824 tokens.


## Fixed-size chunking

    Splits into chunks of N words each.
    Example: 100 words per chunk.
    Simple, but risks cutting paragraphs mid-sentence.

In [5]:
def get_chunks_fixed_size(text: str, chunk_size: int) -> List[str]:
    """
    Splits a given text into chunks of a specified fixed size.

    Args:
        text (str): The input text to be split into chunks.
        chunk_size (int): The maximum number of words per chunk.

    Returns:
        List[str]: A list of text chunks, each containing up to 'chunk_size' words.
    """
    # Split the input text into individual words
    text_words = text.split()
    
    # Initialize a list to hold the chunks of words
    chunks = []
    
    # Iterate over the word indices in steps of 'chunk_size'
    for i in range(0, len(text_words), chunk_size):
        # Select a sublist of words from 'i' to 'i + chunk_size'
        chunk_words = text_words[i: i + chunk_size]
        
        # Join the selected words into a single string with spaces in between
        chunk = " ".join(chunk_words)
        
        # Add the chunk to the list of chunks
        chunks.append(chunk)
    
    # Return the list of word chunks
    return chunks

In [6]:
fixed_size_chunks = get_chunks_fixed_size(source_text, chunk_size = 100)

In [7]:
print(len(fixed_size_chunks))

15


In [8]:
fixed_size_chunks[0:2]

["[[what_is_git_section]] === What is Git? So, what is Git in a nutshell? This is an important section to absorb, because if you understand what Git is and the fundamentals of how it works, then using Git effectively will probably be much easier for you. As you learn Git, try to clear your mind of the things you may know about other VCSs, such as CVS, Subversion or Perforce -- doing so will help you avoid subtle confusion when using the tool. Even though Git's user interface is fairly similar to these other VCSs, Git stores and thinks about information in",
 'a very different way, and understanding these differences will help you avoid becoming confused while using it.(((Subversion)))(((Perforce))) ==== Snapshots, Not Differences The major difference between Git and any other VCS (Subversion and friends included) is the way Git thinks about its data. Conceptually, most other systems store information as a list of file-based changes. These other systems (CVS, Subversion, Perforce, and s

### Chunking with overlap

    Why? Because if a key phrase (like "information in a very different way") falls on a boundary, overlap ensures it still appears in the next chunk.

In [9]:
def get_chunks_fixed_size_with_overlap(text: str, chunk_size: int, overlap_fraction: float) -> List[str]:
    """
    Splits a given text into chunks of a fixed size with a specified overlap fraction between consecutive chunks.

    Parameters:
    - text (str): The input text to be split into chunks.
    - chunk_size (int): The number of words each chunk should contain.
    - overlap_fraction (float): The fraction of the chunk size that should overlap with the adjacent chunk.
      For example, an overlap_fraction of 0.2 means 20% of the chunk size will be used as overlap.

    Returns:
    - List[str]: A list of chunks (each a string) where each chunk might overlap with its adjacent chunk.
    """

    # Split the text into individual words
    text_words = text.split()
    
    # Calculate the number of words to overlap between consecutive chunks
    overlap_int = int(chunk_size * overlap_fraction)
    
    # Initialize a list to store the resulting chunks
    chunks = []
    
    # Iterate over text in steps of chunk_size to create chunks
    for i in range(0, len(text_words), chunk_size):
        # Determine the start and end indices for the current chunk,
        # taking into account the overlap with the previous chunk
        chunk_words = text_words[max(i - overlap_int, 0): i + chunk_size]
        
        # Join the selected words to form a chunk string
        chunk = " ".join(chunk_words)
        
        # Append the chunk to the list of chunks
        chunks.append(chunk)
    
    # Return the list of chunks
    return chunks

In [10]:
for chosen_size in [5, 25, 100]:
    chunks = get_chunks_fixed_size_with_overlap(source_text, chosen_size, overlap_fraction=0.2)
    # Print outputs to screen
    print(f"\nSize {chosen_size} - {len(chunks)} chunks returned.")
    for i in range(3):
        print(f"Chunk {i+1}: {chunks[i]}")
        print('-'*100)
    print('='*100)


Size 5 - 281 chunks returned.
Chunk 1: [[what_is_git_section]] === What is Git?
----------------------------------------------------------------------------------------------------
Chunk 2: Git? So, what is Git in
----------------------------------------------------------------------------------------------------
Chunk 3: in a nutshell? This is an
----------------------------------------------------------------------------------------------------

Size 25 - 57 chunks returned.
Chunk 1: [[what_is_git_section]] === What is Git? So, what is Git in a nutshell? This is an important section to absorb, because if you understand what Git
----------------------------------------------------------------------------------------------------
Chunk 2: if you understand what Git is and the fundamentals of how it works, then using Git effectively will probably be much easier for you. As you learn Git, try to
---------------------------------------------------------------------------------------------

Note that the smaller chunks of text are very detailed, but they might **not have enough information to be useful for searching**. In contrast, **larger chunks start to contain more information, similar to a typical paragraph in length**. As these chunks become even longer, **their associated vector embeddings become more general**. Eventually, they reach a point where they are no longer effective for information searching.

## Variable-size chunking - Recursive Character Splitting

    Splits by double newlines → keeps semantic structure (paragraphs).
    Problem: some paragraphs may be too short (like a heading).

In [11]:
# Split the text into paragraphs
def get_chunks_by_paragraph(source_text: str) -> List[str]:
    return source_text.split("\n\n")

Another way, in this context, is to split into sections. As we can see inspecting the text, sections are divided with `\n==` markers.

In [12]:
# Split the text by Asciidoc section markers
def get_chunks_by_asciidoc_sections(source_text: str) -> List[str]:
    return source_text.split("\n==")

In [13]:
for marker in ["\n\n", "\n=="]:
    chunks = source_text.split(marker)
    # Print outputs to screen
    print(f"\nUsing the marker: {repr(marker)} - {len(chunks)} chunks returned.")
    for i in range(3):
        print(f"Chunk {i+1}: {repr(chunks[i])}")
        print('-'*100)
    print('='*100)


Using the marker: '\n\n' - 31 chunks returned.
Chunk 1: '[[what_is_git_section]]\n=== What is Git?'
----------------------------------------------------------------------------------------------------
Chunk 2: "So, what is Git in a nutshell?\nThis is an important section to absorb, because if you understand what Git is and the fundamentals of how it works, then using Git effectively will probably be much easier for you.\nAs you learn Git, try to clear your mind of the things you may know about other VCSs, such as CVS, Subversion or Perforce -- doing so will help you avoid subtle confusion when using the tool.\nEven though Git's user interface is fairly similar to these other VCSs, Git stores and thinks about information in a very different way, and understanding these differences will help you avoid becoming confused while using it.(((Subversion)))(((Perforce)))"
----------------------------------------------------------------------------------------------------
Chunk 3: '==== Snapsho

One noticeable issue with simple marker-based chunking is that **headings often become separate chunks**, which might not be ideal. In practice, we might use a mixed strategy by attaching short chunks, like headings, to the following chunk. This way, the heading stays connected to its relevant section. Let's explore this approach further.

### Mixing fixed and variable-sized chunking

Use a variable-size chunker to divide text at paragraph markers, and then apply a fixed-size filter. If a chunk is too small, we can merge it with the next one, and if a chunk is too large, we can split it in the middle or at another marker within the chunk.

    Hybrid strategy:
    Split by sections.
    If too short (< 25 words), merge with next.
    Keeps headings + content together, avoids tiny useless chunks.

In [14]:
def mixed_chunking(source_text):
    # Split the text by Asciidoc marker
    chunks = source_text.split("\n==")

    # Chunking logic
    new_chunks = []
    chunk_buffer = ""
    min_length = 25

    for chunk in chunks:
        new_buffer = chunk_buffer + chunk  # Create new buffer
        new_buffer_words = new_buffer.split(" ")  # Split into words
        if len(new_buffer_words) < min_length:  # Check whether buffer length is too small
            chunk_buffer = new_buffer  # Carry over to the next chunk
        else:
            new_chunks.append(new_buffer)  # Add to chunks
            chunk_buffer = ""

    if len(chunk_buffer) > 0:
        new_chunks.append(chunk_buffer)  # Add last chunk, if necessary

    return new_chunks

In [15]:
mixed_chunks = mixed_chunking(source_text)
for i in range(3):
    print(f"Chunk {i+1}: {repr(mixed_chunks[i])}")
    print('='*100)

Chunk 1: "[[what_is_git_section]]= What is Git?\n\nSo, what is Git in a nutshell?\nThis is an important section to absorb, because if you understand what Git is and the fundamentals of how it works, then using Git effectively will probably be much easier for you.\nAs you learn Git, try to clear your mind of the things you may know about other VCSs, such as CVS, Subversion or Perforce -- doing so will help you avoid subtle confusion when using the tool.\nEven though Git's user interface is fairly similar to these other VCSs, Git stores and thinks about information in a very different way, and understanding these differences will help you avoid becoming confused while using it.(((Subversion)))(((Perforce)))\n"
Chunk 2: "== Snapshots, Not Differences\n\nThe major difference between Git and any other VCS (Subversion and friends included) is the way Git thinks about its data.\nConceptually, most other systems store information as a list of file-based changes.\nThese other systems (CVS, Subv

This strategy helps ensure that chunks are not too small while still using syntactic markers, like headings, to define boundaries. After examining chunking strategies on one text, let's explore how they perform on a larger collection of texts.

## Chunking on real data
### Apply all strategies

In [16]:
res = requests.get('https://api.github.com/repos/progit/progit2/contents/book/01-introduction/sections').json()
print(res[0])
print(res[0]['type'])

{'name': 'about-version-control.asc', 'path': 'book/01-introduction/sections/about-version-control.asc', 'sha': '182fcedc00afbdd30fef442864561e6a5a104284', 'size': 4698, 'url': 'https://api.github.com/repos/progit/progit2/contents/book/01-introduction/sections/about-version-control.asc?ref=main', 'html_url': 'https://github.com/progit/progit2/blob/main/book/01-introduction/sections/about-version-control.asc', 'git_url': 'https://api.github.com/repos/progit/progit2/git/blobs/182fcedc00afbdd30fef442864561e6a5a104284', 'download_url': 'https://raw.githubusercontent.com/progit/progit2/main/book/01-introduction/sections/about-version-control.asc', 'type': 'file', '_links': {'self': 'https://api.github.com/repos/progit/progit2/contents/book/01-introduction/sections/about-version-control.asc?ref=main', 'git': 'https://api.github.com/repos/progit/progit2/git/blobs/182fcedc00afbdd30fef442864561e6a5a104284', 'html': 'https://github.com/progit/progit2/blob/main/book/01-introduction/sections/about

In [17]:
res[0]['download_url'].split('/')

['https:',
 '',
 'raw.githubusercontent.com',
 'progit',
 'progit2',
 'main',
 'book',
 '01-introduction',
 'sections',
 'about-version-control.asc']

In [18]:
res[0]['download_url'].split('/')[-1]

'about-version-control.asc'

In [19]:
requests.get(res[0]['download_url']).text

'=== About Version Control\n\n(((version control)))\nWhat is "`version control`", and why should you care?\nVersion control is a system that records changes to a file or set of files over time so that you can recall specific versions later.\nFor the examples in this book, you will use software source code as the files being version controlled, though in reality you can do this with nearly any type of file on a computer.\n\nIf you are a graphic or web designer and want to keep every version of an image or layout (which you would most certainly want to), a Version Control System (VCS) is a very wise thing to use.\nIt allows you to revert selected files back to a previous state, revert the entire project back to a previous state, compare changes over time, see who last modified something that might be causing a problem, who introduced an issue and when, and more.\nUsing a VCS also generally means that if you screw things up or lose files, you can easily recover.\nIn addition, you get all 

In [20]:
def get_book_text_objects():
    # Source location
    text_objs = list()
    api_base_url = 'https://api.github.com/repos/progit/progit2/contents/book'  # Book base URL
    chapter_urls = ['/01-introduction/sections', '/02-git-basics/sections']  # List of section URLs

    # Loop through book chapters
    for chapter_url in chapter_urls:
        response = requests.get(api_base_url + chapter_url)  # Get the JSON data for the section files in the chapter

        # Loop through inner files (sections)
        for file_info in response.json():
            if file_info['type'] == 'file':  # Only process files (not directories)
                file_response = requests.get(file_info['download_url'])

                # Build objects including metadata
                chapter_title = file_info['download_url'].split('/')[-3]
                filename = file_info['download_url'].split('/')[-1]
                text_obj = {
                    "body": file_response.text,
                    "chapter_title": chapter_title,
                    "filename": filename
                }
                text_objs.append(text_obj)
    return text_objs

In [21]:
# This will generate a list with 14 elements, one for each chapter
book_text_objs = get_book_text_objects()

In [22]:
print(book_text_objs[0].keys())

dict_keys(['body', 'chapter_title', 'filename'])


### Chunking the chapters

The following chunking methods will be applied to each section:

- **Fixed-length chunks with 20% overlap:**
  - Chunks with 25 words each
  - Chunks with 100 words each

- **Variable-length chunks** using paragraph markers

- **Mixed-strategy chunks** using paragraph markers with a minimum chunk length of 25 words

Additionally, metadata will be added to each chunk, including the filename, chapter name, and chunk number.

In [23]:
def build_chunk_objs(book_text_obj, chunks):
    """
    Constructs a list of chunk objects from a given book text object 
    and its associated chunks.

    Args:
        book_text_obj (dict): A dictionary containing metadata for the book text, 
                              including 'chapter_title' and 'filename'.
        chunks (list): A list of chunks that represent parts of the book text.

    Returns:
        list: A list of dictionaries, each representing a chunk object 
              with 'chapter_title', 'filename', 'chunk', and 'chunk_index'.
    """
    chunk_objs = list()  # Initialize an empty list to store chunk objects
    
    # Iterate over the chunks with an index
    for i, c in enumerate(chunks):
        # Create a dictionary for each chunk with its associated data
        chunk_obj = {
            "chapter_title": book_text_obj["chapter_title"],  # Chapter title from the book text object
            "filename": book_text_obj["filename"],            # Filename from the book text object
            "chunk": c,                                       # The actual chunk of text
            "chunk_index": i                                  # The index of the chunk in the list
        }
        # Append the constructed chunk object to the list
        chunk_objs.append(chunk_obj)

    # Return the list of chunk objects
    return chunk_objs

    Get multiple sets of chunks - according to chunking strategy

    fixed_size_25 → Breaks text into fixed-size chunks of ~25 tokens (with 20% overlap).

    fixed_size_100 → Same as above, but ~100 tokens per chunk.

    para_chunks → Splits text by paragraph boundaries.

    para_chunks_min_25 → A mixed strategy that ensures paragraph chunks but guarantees minimum length (~25 tokens).

    The result (chunk_obj_sets) looks like:

    {
      "fixed_size_25": [chunk_obj1, chunk_obj2, ...],
      "fixed_size_100": [...],
      "para_chunks": [...],
      "para_chunks_min_25": [...]
    }

In [24]:
# Keys will be the strategy name (e.g., "fixed_size_25")
# Values will be lists of chunk objects produced by that strategy.
chunk_obj_sets = dict()

for book_text_obj in book_text_objs:
    text = book_text_obj["body"]  # Get the object's text body

    # Loop through chunking strategies:
    for strategy_name, chunks in [
        ["fixed_size_25", get_chunks_fixed_size_with_overlap(text, 25, 0.2)],
        ["fixed_size_100", get_chunks_fixed_size_with_overlap(text, 100, 0.2)],
        ["para_chunks", get_chunks_by_paragraph(text)],
        ["para_chunks_min_25", mixed_chunking(text)]
    ]:
        chunk_objs = build_chunk_objs(book_text_obj, chunks)

        if strategy_name not in chunk_obj_sets.keys():
            chunk_obj_sets[strategy_name] = list()

        chunk_obj_sets[strategy_name] += chunk_objs

In [25]:
print(chunk_obj_sets.keys())

dict_keys(['fixed_size_25', 'fixed_size_100', 'para_chunks', 'para_chunks_min_25'])


In [26]:
chunk_type = 'fixed_size_25'
chunk_obj_sets[chunk_type][0:2]

[{'chapter_title': '01-introduction',
  'filename': 'about-version-control.asc',
  'chunk': '=== About Version Control (((version control))) What is "`version control`", and why should you care? Version control is a system that records changes to a',
  'chunk_index': 0},
 {'chapter_title': '01-introduction',
  'filename': 'about-version-control.asc',
  'chunk': 'that records changes to a file or set of files over time so that you can recall specific versions later. For the examples in this book, you will use software',
  'chunk_index': 1}]

In [27]:
chunk_type = 'para_chunks_min_25'
chunk_obj_sets[chunk_type][0:2]

[{'chapter_title': '01-introduction',
  'filename': 'about-version-control.asc',
  'chunk': '=== About Version Control\n\n(((version control)))\nWhat is "`version control`", and why should you care?\nVersion control is a system that records changes to a file or set of files over time so that you can recall specific versions later.\nFor the examples in this book, you will use software source code as the files being version controlled, though in reality you can do this with nearly any type of file on a computer.\n\nIf you are a graphic or web designer and want to keep every version of an image or layout (which you would most certainly want to), a Version Control System (VCS) is a very wise thing to use.\nIt allows you to revert selected files back to a previous state, revert the entire project back to a previous state, compare changes over time, see who last modified something that might be causing a problem, who introduced an issue and when, and more.\nUsing a VCS also generally means t

### Loading Chunks into a Vector Database

Loading chunks into a vector database. Below, we will work with a pre-loaded collection to save time.

In [28]:
import subprocess
from contextlib import contextmanager

@contextmanager
def suppress_subprocess_output():
    """
    Context manager that suppresses the standard output and error 
    of any subprocess.Popen calls within this context.
    """
    # Store the original Popen
    original_popen = subprocess.Popen

    def patched_popen(*args, **kwargs):
        # Redirect the stdout and stderr to subprocess.DEVNULL
        kwargs['stdout'] = subprocess.DEVNULL
        kwargs['stderr'] = subprocess.DEVNULL
        return original_popen(*args, **kwargs)

    try:
        # Apply the patch by replacing subprocess.Popen with patched_popen
        subprocess.Popen = patched_popen
        # Yield control back to the context
        yield
    finally:
        # Ensure that the original Popen method is restored
        subprocess.Popen = original_popen

In [29]:
from flask import Flask
import threading
import json
import numpy as np
import threading
import logging

app = Flask(__name__)

In [30]:
app.logger.disabled = True
# Get the Flask app's logger
log = logging.getLogger('werkzeug')
# Set logging level (ERROR or CRITICAL suppresses routing logs)
log.setLevel(logging.ERROR)
def run_app():
    app.run(host='0.0.0.0', port=5000, debug = False)

flask_thread = threading.Thread(target=run_app)
flask_thread.start()

 * Serving Flask app '__main__'
 * Debug mode: off


Address already in use
Port 5000 is in use by another program. Either identify and stop that program, or start the server with a different port.


In [31]:
# Loading the client
with suppress_subprocess_output():
    try:
        client = weaviate.connect_to_embedded(
            persistence_data_path="/home/jovyan/data/collections/m3/chunking/",
            environment_variables={
                "ENABLE_API_BASED_MODULES": "true", # Enable API based modules 
                "ENABLE_MODULES": 'text2vec-transformers', # We will be using a transformer model 
                "TRANSFORMERS_INFERENCE_API":"http://127.0.0.1:5000/", # The endpoint the weaviate API will be using to vectorize
            }
        )
    except Exception as e:
        ports = extract_ports(str(e))
        client = weaviate.connect_to_local(port=8079, grpc_port=50050)

In [32]:
client.collections.exists("chunking_example")

True

In [33]:
# Creating the collection
if not client.collections.exists("chunking_example"):
    collection = client.collections.create(
            name='chunking_example',

            vectorizer_config=[Configure.NamedVectors.text2vec_transformers(
                    name="vector", # This is the name you will need to access the vectors of the objects in your collection
                    #source_properties=['chunk'], # which properties should be used to generate a vector, they will be appended to each other when vectorizing
                    vectorize_collection_name = False, # This tells the client to not vectorize the collection name. 
                                                       # If True, it will be appended at the beginning of the text to be vectorized
                    inference_url="http://127.0.0.1:5000", # Since we are using an API based vectorizer, you need to pass the URL used to make the calls 
                                                           # This was setup in our Flask application
                )],

            properties=[  # Define properties
            Property(name="chunk",data_type= DataType.TEXT),
            Property(name="chapter_title", data_type=DataType.TEXT),
            Property(name="filename",data_type=DataType.TEXT),
            Property(name="chunking_strategy",data_type=DataType.TEXT, tokenization = Tokenization.FIELD), # tokenization = Tokenization.FIELD means that the entire word will be treated as a token,
            Property(name="chunk_index",data_type=DataType.INT),

        ]
        )
else:
    collection = client.collections.get("chunking_example")

In [34]:
# Adding elements in the collection - this insertion should NOT run as the collection is already vectorized for you. 
if len(collection) == 0:
    with collection.batch.fixed_size(batch_size=1, concurrent_requests=1) as batch:
        for chunking_strategy, chunk_objects in tqdm.tqdm(chunk_obj_sets.items()):
            for chunk_obj in chunk_objects:
                chunk_obj["chunking_strategy"] = chunking_strategy
                batch.add_object(
                    properties=chunk_obj,
                    uuid=generate_uuid5(chunk_obj)
                )

In [37]:
print(f"Total count: {collection.aggregate.over_all().total_count}")
for chunking_strategy in chunk_obj_sets.keys():
    where_filter = Filter.by_property('chunking_strategy').equal(chunking_strategy) # Filter by chunking strategy
    count = collection.aggregate.over_all(filters = where_filter).total_count # Aggregate with filtering
    print(f"Object count for {chunking_strategy}: {count}")

Total count: 1487
Object count for fixed_size_25: 672
Object count for fixed_size_100: 173
Object count for para_chunks: 549
Object count for para_chunks_min_25: 93


## Searching 

Semantic searching with different chunk sizes to visualize the impacts of the sizes in information retrieval.

In [38]:
search_string = "history of git" 

for chunking_strategy in chunk_obj_sets.keys():
    where_filter = Filter.by_property('chunking_strategy').equal(chunking_strategy)
    response = collection.query.near_text(search_string, filters = where_filter, limit = 2)
    print(f"RETRIEVED OBJECTS FOR CHUNKING STRATEGY {chunking_strategy.upper()}:\n")
    for i, obj in enumerate(response.objects):
        print(f"===== Object {i} =====")
        print(f"{obj.properties['chunk']}")
        print('-'*100)
    print('='*100)

RETRIEVED OBJECTS FOR CHUNKING STRATEGY FIXED_SIZE_25:

===== Object 0 =====
=== A Short History of Git As with many great things in life, Git began with a bit of creative destruction and fiery controversy. The
----------------------------------------------------------------------------------------------------
===== Object 1 =====
kernel efficiently (speed and data size) Since its birth in 2005, Git has evolved and matured to be easy to use and yet retain these initial qualities. It's amazingly fast,
----------------------------------------------------------------------------------------------------
RETRIEVED OBJECTS FOR CHUNKING STRATEGY FIXED_SIZE_100:

===== Object 0 =====
=== A Short History of Git As with many great things in life, Git began with a bit of creative destruction and fiery controversy. The Linux kernel is an open source software project of fairly large scope.(((Linux))) During the early years of the Linux kernel maintenance (1991–2002), changes to the software were pa

In this example, the query is a broad one focused on the "history of git." The results show that longer chunks tend to perform better. Upon examination, while the 25-word chunks might closely match the query in terms of semantic similarity, they lack sufficient context to significantly enhance the reader's understanding of the topic. Conversely, the paragraph chunks retrieved—particularly those with a minimum length of 25 words—provide comprehensive information that effectively educates the reader about the history of Git.

In [39]:
search_string = "how to add the url of a remote repository"

for chunking_strategy in chunk_obj_sets.keys():
    where_filter = Filter.by_property('chunking_strategy').equal(chunking_strategy)
    response = collection.query.near_text(search_string, filters = where_filter, limit = 2)
    print(f"RETRIEVED OBJECTS FOR CHUNKING STRATEGY {chunking_strategy.upper()}:\n")
    for i, obj in enumerate(response.objects):
        print(f"===== Object {i} =====")
        print(f"{obj.properties['chunk']}")
        print('-'*100)
    print('='*100)

RETRIEVED OBJECTS FOR CHUNKING STRATEGY FIXED_SIZE_25:

===== Object 0 =====
remote))) To add a new remote Git repository as a shortname you can reference easily, run `git remote add <shortname> <url>`: [source,console] ---- $ git remote origin $ git remote
----------------------------------------------------------------------------------------------------
===== Object 1 =====
manage your remote repositories. Remote repositories are versions of your project that are hosted on the Internet or network somewhere. You can have several of them, each of which generally
----------------------------------------------------------------------------------------------------
RETRIEVED OBJECTS FOR CHUNKING STRATEGY FIXED_SIZE_100:

===== Object 0 =====
adds the `origin` remote for you. Here's how to add a new remote explicitly.(((git commands, remote))) To add a new remote Git repository as a shortname you can reference easily, run `git remote add <shortname> <url>`: [source,console] ---- $ git remo

In this example, the query was more specific, such as one made by a user looking to find out how to add the URL of a remote repository. Unlike the previous scenario, the 25-word chunks prove more useful here. Because the question was very specific, Weaviate could pinpoint the chunk with the most relevant passage—how to add a remote repository (`git remote add <shortname> <url>`). 

Although other result sets contain some of this information, it's important to consider how the result will be used and displayed. Longer results might require more cognitive effort from the user to extract the relevant information.

## Incorporating in a RAG system


We have a fully working collection, let's see how different chunk sizes impact text generation. Let's use a simple prompt.

In [40]:
PROMPT = "Using this information and only this information, please explain {search_string} in a few short points.\nContext: {context}"

In [41]:
print(PROMPT)

Using this information and only this information, please explain {search_string} in a few short points.
Context: {context}


In [42]:
from utils import generate_with_single_input

In [44]:
# Set number of chunks to retrieve to compensate for different chunk sizes

n_chunks_by_strat = dict()

# Grab more of shorter chunks
n_chunks_by_strat['fixed_size_25'] = 8
n_chunks_by_strat['para_chunks'] = 8

# Grab fewer of longer chunks
n_chunks_by_strat['fixed_size_100'] = 2
n_chunks_by_strat['para_chunks_min_25'] = 2

# Perform Retreval augmented generation
search_string = "history of git"  # Or "available git remote commands"

for chunking_strategy in chunk_obj_sets.keys():
    where_filter = Filter.by_property('chunking_strategy').equal(chunking_strategy)
    response = collection.query.near_text(search_string, filters = where_filter, limit = n_chunks_by_strat[chunking_strategy])
    context_string = ""
    for obj in response.objects:
        context_string += obj.properties['chunk'] + '\n'
    prompt = PROMPT.format(search_string = search_string, context = context_string)
    response = generate_with_single_input(prompt, role = 'assistant')
    print(f"Search string: {search_string}")
    print(f"Chunking Strategy: {chunking_strategy}:")
    print(f"Response:\n\t{response['content']}")
    print('='*100)

Search string: history of git
Chunking Strategy: fixed_size_25:
Response:
	Here are a few short points summarizing the history of Git:

1. **Creation and Early Development (2005-2008)**: Git was created by Scott Chacon and Junio Hamano in 2005. The first version was released in 2006, and it underwent significant development and refinement until 2008.

2. **Initial Commit and First Public Release (2008)**: The first public release of Git was in October 2008, with Junio Hamano contributing to the project. This release included the initial commit history.

3. **Maturity and Evolution (2009-Present)**: After its initial release, Git continued to evolve and mature. It became widely adopted in the open-source community and was eventually acquired by GitHub in 2018.

4. **Key Features and Advancements**: Throughout its development, Git has retained its core efficiency and speed while adding new features and improving its user interface. Some notable advancements include the introduction of th

In [45]:
client.close()