### semantic chunking experiments

Required packages installation

In [1]:
!python --version

Python 3.12.7


In [2]:
!pip install sentence-transformers langchain seaborn scikit-learn rich


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.2[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [3]:
!pip list

Package                  Version
------------------------ -----------
aiohappyeyeballs         2.4.4
aiohttp                  3.11.11
aiosignal                1.3.2
annotated-types          0.7.0
anyio                    4.7.0
asttokens                3.0.0
attrs                    24.3.0
certifi                  2024.12.14
charset-normalizer       3.4.0
comm                     0.2.2
contourpy                1.3.1
cycler                   0.12.1
debugpy                  1.8.11
decorator                5.1.1
executing                2.1.0
filelock                 3.16.1
fonttools                4.55.3
frozenlist               1.5.0
fsspec                   2024.12.0
greenlet                 3.1.1
h11                      0.14.0
httpcore                 1.0.7
httpx                    0.28.1
huggingface-hub          0.27.0
idna                     3.10
ipykernel                6.29.5
ipython                  8.31.0
jedi                     0.19.2
Jinja2                   3.1.5
joblib    

Global imports

In [4]:
from rich import print

[Documents](https://paulgraham.com/mit.html) (A Student's Guide to Startups, Paul Graham) for chunking evaluation uploading

In [5]:
with open('./paul_graham_essay.txt', 'r') as file:
    essay = file.read()

essay[:256]

"A Student's Guide to Startups\n\nWant to start a startup? Get funded by Y Combinator.\n\nOctober 2006\n\n(This essay is derived from a talk at MIT.)\n\nTill recently graduating seniors had two choices: get a job or go to grad school. I think there will increasingl"

Splitting the uploaded document by sentences (.!?)\s+

In [6]:
import re

split_essay = re.split(r'(?<=[.!?])\s+', essay)
print(len(split_essay))
print(split_essay[:3])

### researhing how embeddings behave on the paragraphs

Extracting 4 paragraphs from the splitted sentences: 

In [7]:
p1 = split_essay[5:8]
p1

["I'm sure the default will always be to get a job, but starting a startup could well become as popular as grad school.",
 "In the late 90s my professor friends used to complain that they couldn't get grad students, because all the undergrads were going to work for startups.",
 "I wouldn't be surprised if that situation returns, but with one difference: this time they'll be starting their own instead of going to work for other people's."]

In [8]:
p2 = split_essay[8:12]
p2

['The most ambitious students will at this point be asking: Why wait till you graduate?',
 "Why not start a startup while you're in college?",
 'In fact, why go to college at all?',
 'Why not start a startup instead?']

In [9]:
p3 = split_essay[12:18]
p3

['A year and a half ago I gave a talk where I said that the average age of the founders of Yahoo, Google, and Microsoft was 24, and that if grad students could start startups, why not undergrads?',
 "I'm glad I phrased that as a question, because now I can pretend it wasn't merely a rhetorical one.",
 "At the time I couldn't imagine why there should be any lower limit for the age of startup founders.",
 'Graduation is a bureaucratic change, not a biological one.',
 'And certainly there are undergrads as competent technically as most grad students.',
 "So why shouldn't undergrads be able to start startups as well as grad students?"]

In [10]:
p4 = split_essay[18:22]
p4

['I now realize that something does change at graduation: you lose a huge excuse for failing.',
 "Regardless of how complex your life is, you'll find that everyone else, including your family and friends, will discard all the low bits and regard you as having a single occupation at any given time.",
 "If you're in college and have a summer job writing software, you still read as a student.",
 "Whereas if you graduate and get a job programming, you'll be instantly regarded by everyone as a programmer."]

Defining sentence transformer model to use. Specifically, all-mpnet-base-v2 (based on the [stats](https://sbert.net/docs/sentence_transformer/pretrained_models.html) it performs the best, from the locally hosted models options)

In [11]:
from sentence_transformers import SentenceTransformer

model = SentenceTransformer('sentence-transformers/all-mpnet-base-v2')

  from .autonotebook import tqdm as notebook_tqdm


Cosine similarity between embeddings calculation

In [12]:
embeddings = model.encode(p1)
model.similarity(embeddings, embeddings)
print(p1)
print(model.similarity_fn_name)
print(model.similarity(embeddings, embeddings))

### testing the sliding window approach

I am trying to form consecutively bigger chunks by adding sentences one-by-one. The biggest chunk will consist of all sentences in the paragraph. In this particular case we form chunks from only one paragraph, then embed them.

In [13]:
prev = ''
chunks = []
for sentence in p1:
    res = prev + ' ' + sentence
    chunks.append(res)
    prev = res

chunks

[" I'm sure the default will always be to get a job, but starting a startup could well become as popular as grad school.",
 " I'm sure the default will always be to get a job, but starting a startup could well become as popular as grad school. In the late 90s my professor friends used to complain that they couldn't get grad students, because all the undergrads were going to work for startups.",
 " I'm sure the default will always be to get a job, but starting a startup could well become as popular as grad school. In the late 90s my professor friends used to complain that they couldn't get grad students, because all the undergrads were going to work for startups. I wouldn't be surprised if that situation returns, but with one difference: this time they'll be starting their own instead of going to work for other people's."]

Let's caclculate the distances between the embedded chunks:

In [14]:
embeddings = model.encode(chunks)
model.similarity(embeddings, embeddings)

tensor([[1.0000, 0.9266, 0.9116],
        [0.9266, 1.0000, 0.9905],
        [0.9116, 0.9905, 1.0000]])

After the chunks embedding calculation, we see how the distance consecutively and slowly drifts away from the first chunk.

Let's mix two paragraphs together. Paragraphs mark that the chunks of text, by the intention of the author, have different semantic meaning.

In [15]:
prev = ''
mixed_chunks = []
for sentence in [*p1, *p2]:
    res = prev + ' ' + sentence
    mixed_chunks.append(res)
    prev = res

mixed_chunks

[" I'm sure the default will always be to get a job, but starting a startup could well become as popular as grad school.",
 " I'm sure the default will always be to get a job, but starting a startup could well become as popular as grad school. In the late 90s my professor friends used to complain that they couldn't get grad students, because all the undergrads were going to work for startups.",
 " I'm sure the default will always be to get a job, but starting a startup could well become as popular as grad school. In the late 90s my professor friends used to complain that they couldn't get grad students, because all the undergrads were going to work for startups. I wouldn't be surprised if that situation returns, but with one difference: this time they'll be starting their own instead of going to work for other people's.",
 " I'm sure the default will always be to get a job, but starting a startup could well become as popular as grad school. In the late 90s my professor friends used to 

Let's caclculate the distances between the embedded chunks:

In [16]:
embeddings = model.encode(mixed_chunks)
model.similarity(embeddings, embeddings)

tensor([[1.0000, 0.9266, 0.9116, 0.8851, 0.8701, 0.8668, 0.8665],
        [0.9266, 1.0000, 0.9905, 0.9698, 0.9663, 0.9607, 0.9580],
        [0.9116, 0.9905, 1.0000, 0.9805, 0.9725, 0.9661, 0.9638],
        [0.8851, 0.9698, 0.9805, 1.0000, 0.9922, 0.9878, 0.9856],
        [0.8701, 0.9663, 0.9725, 0.9922, 1.0000, 0.9984, 0.9968],
        [0.8668, 0.9607, 0.9661, 0.9878, 0.9984, 1.0000, 0.9991],
        [0.8665, 0.9580, 0.9638, 0.9856, 0.9968, 0.9991, 1.0000]])

Once again we see the same pattern. The distance slowly drifts away. The main quesiton arises. How to understand where the edge is? How to find the threshold value to use? Those questions yet to be answered.

### algorithm idea

Let's basically formulate the algorithm idea we have for now:
1. Set the threshold value, for instance 0.9.
2. While the threshold value is not passed:
   
   2.1 Iteratively take the next sentence and add it to the previous iteration's chunk.
   
   2.2 Calculate the embedding of the newly created chunk. 

   2.3 If the value of the distance between the new chunk's embedding and the initial sentence exceeds or equals the threshold => continue.

   2.4 Otherwise, stop the iteration.

The last chunk before the stop is a formed semantic unit. We further can rerun the algorithm starting from the sentence, which broke the loop.

### algorithm implemenation

In [17]:
a = model.encode('dog')
b = model.encode('cat')

res = model.similarity(a, b)
res.numpy()[0][0]

np.float32(0.60812265)

In [18]:
def semantic_chunking(text: list[str], thresh: int = 0.9, verbose: bool = True) -> list[str]:
    prev = ''
    init = text[0]
    chunks = []
    
    for sentence in text:
        res = prev + ' ' + sentence
        dist = model.similarity(
            model.encode(init),
            model.encode(res)
        ).numpy()[0][0]  

        if dist < thresh:
            # logging
            if verbose:
                print(f'prev: {prev}\nbreakpoint sentence: {sentence}\ndist: {dist}\nchunks count: {len(chunks) + 1}')
                print('=' * 25)
            
            chunks.append(prev)
            prev = sentence
            init = sentence
        else:
            prev = res 

    # edge case handling
    if prev not in chunks:
        chunks.append(prev)

    return chunks


Testing the algorithm implementation

In [19]:
text = split_essay[5:22]
text

["I'm sure the default will always be to get a job, but starting a startup could well become as popular as grad school.",
 "In the late 90s my professor friends used to complain that they couldn't get grad students, because all the undergrads were going to work for startups.",
 "I wouldn't be surprised if that situation returns, but with one difference: this time they'll be starting their own instead of going to work for other people's.",
 'The most ambitious students will at this point be asking: Why wait till you graduate?',
 "Why not start a startup while you're in college?",
 'In fact, why go to college at all?',
 'Why not start a startup instead?',
 'A year and a half ago I gave a talk where I said that the average age of the founders of Yahoo, Google, and Microsoft was 24, and that if grad students could start startups, why not undergrads?',
 "I'm glad I phrased that as a question, because now I can pretend it wasn't merely a rhetorical one.",
 "At the time I couldn't imagine why

In [20]:
chunks = semantic_chunking(text, thresh=0.88)

In [21]:
print(len(chunks))
print(chunks)

### tuning the thersh value

Since the paragraphs represent semantic units by the author's intentions, theoretically, we could use this information to tune the threshold for the algorithm until we find the threshold value that produces an amount of chunks close to the amount of paragraphs.

First, let's find out the amount of paragraphs the document has:

In [22]:
p_split_essay = essay.split('\n\n')
print(len(p_split_essay))
print(p_split_essay[:5])

Let's filter the splitted essay from the empty lines:

In [23]:
filter_essay = list(filter(lambda p: len(p) > 0, p_split_essay))
len(filter_essay)

87

#### thresh tuning algorithm idea

Now, let's formulate the threshold tuning algorithm idea:

1. Start from the thresh value of 0.99, since in cosine similarity 1 indicates that two points are identical (realistically, this value should probably be lower, approx. 0.95-0.97).  
2. Initialize the previous iteration distance variable and the best threshold variable.  
3. While true:  

   2.1 Perform the semantic chunking with the provided threshold over text (a bunch of split sentences);  

   2.2 Compare the amount of calculated chunks to the actual amount of paragraphs;  

   2.3 If the calculated chunks amount is closer to the actual amount of paragraphs than the previous iteration chunks amount variable:  
   
   2.3.1 Update the thresh variable, lowering it by some step value, e.g., 0.02-0.05;  

   2.3.2 Update the best threshold variable;  

   2.3.3 Continue.  
   
   2.4 Otherwise, stop the loop.  

In result, we will calculate the best thresh variable that outputs the amount of chunks closest to the actual amount of paragraphs.  

#### thresh tuning algorithm implementation

In [24]:
dist = lambda a, b: abs(a - b)

def thresh_tune(
    pars: int,  # pars - paragraphs
    text: list[str], 
    thresh: float = 0.97, 
    step: float = 0.03,
    verbose: bool = True 
):
    best_thresh = 0
    prev_dist = float('inf') 

    while True:
        chunks = semantic_chunking(text, thresh, verbose=False)
        dist_ = dist(pars, len(chunks))
        
        if dist_ < prev_dist:
            # logging 
            if verbose: 
                print(f'dist: {dist_}\nchunks amount: {len(chunks)}\nthresh: {thresh}')
                print('=' * 25)
            
            prev_dist = dist_
            best_thresh = thresh
            thresh -= step 
        else:
            break

    return best_thresh

Let's test the implementation:

In [25]:
paragraphs = len(filter_essay)
best_thresh = thresh_tune(paragraphs, split_essay[5:], thresh=0.94, step=0.01)
best_thresh

0.8399999999999999

In [26]:
semantic_chunking(
    split_essay[5:],
    0.86
)[:5]

[" I'm sure the default will always be to get a job, but starting a startup could well become as popular as grad school. In the late 90s my professor friends used to complain that they couldn't get grad students, because all the undergrads were going to work for startups. I wouldn't be surprised if that situation returns, but with one difference: this time they'll be starting their own instead of going to work for other people's. The most ambitious students will at this point be asking: Why wait till you graduate? Why not start a startup while you're in college? In fact, why go to college at all? Why not start a startup instead?",
 "A year and a half ago I gave a talk where I said that the average age of the founders of Yahoo, Google, and Microsoft was 24, and that if grad students could start startups, why not undergrads? I'm glad I phrased that as a question, because now I can pretend it wasn't merely a rhetorical one. At the time I couldn't imagine why there should be any lower limi

#### tuning co