# Two Stage Splitting

In this post I'll share a simple technique to get the best splits from your documents using a two stage approach. We will use docling to extract the content of the essay [Meditations on Moloch](https://slatestarcodex.com/2014/07/30/meditations-on-moloch/) by Scott Alexander.

<!-- more -->

In [9]:
from docling.document_converter import DocumentConverter
import numpy as np
import plotly.express as px
from langchain_huggingface import HuggingFaceEmbeddings

In [2]:
converter = DocumentConverter()

meditations_on_molloch_url = (
    "https://slatestarcodex.com/2014/07/30/meditations-on-moloch/"
)

result = converter.convert(meditations_on_molloch_url)

md_result = result.document.export_to_markdown()

len(md_result)

717190

In [3]:
print(md_result[:256])

### Economics

- Artir Kel
- Bryan Caplan
- David Friedman
- Pseudoerasmus
- Scott Sumner
- Tyler Cowen

Effective Altruism

### Effective Altruism

- 80000 Hours Blog
- Effective Altruism Forum
- GiveWell Blog

Rationality

### Rationality

- Alyssa Vance


Alright, we have 717,190 characters in the document! Let's start with a naive approach and see what our chunks look like.

In [43]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

recursive_splitter = RecursiveCharacterTextSplitter(chunk_size=512)

recursive_chunks = recursive_splitter.split_text(md_result)

In [44]:
len(recursive_chunks)

2298

In [46]:
# plotly express histogram of chunk lengths
import plotly.express as px

px.histogram(x=[len(chunk) for chunk in recursive_chunks]).show()


In [54]:
random_index = np.random.randint(0, len(recursive_chunks))
print(recursive_chunks[random_index])

“Battle Hymn of the Republic” as a social justice hymn. MY objection is that I’d be a Christian if I could identify even a little with the text itself – the Original Sin, the patriarchal bullshit, the opposition to homosexuality, Paul’s command to women to be silent in church… these people, mostly one particular preacher at my school, have said that I secretly agreed with them, therefore I had to give up everything I love – my not-Christian lover, my atheist family, my friends who are gay or polyamorous or


In [55]:
from langchain_text_splitters import NLTKTextSplitter

nltk_splitter = NLTKTextSplitter(chunk_size=512)

nltk_chunks = nltk_splitter.split_text(md_result)
len(nltk_chunks)

Created a chunk of size 2192, which is longer than the specified 512
Created a chunk of size 808, which is longer than the specified 512
Created a chunk of size 533, which is longer than the specified 512
Created a chunk of size 663, which is longer than the specified 512
Created a chunk of size 613, which is longer than the specified 512
Created a chunk of size 559, which is longer than the specified 512
Created a chunk of size 858, which is longer than the specified 512
Created a chunk of size 858, which is longer than the specified 512
Created a chunk of size 913, which is longer than the specified 512
Created a chunk of size 531, which is longer than the specified 512
Created a chunk of size 841, which is longer than the specified 512
Created a chunk of size 667, which is longer than the specified 512
Created a chunk of size 537, which is longer than the specified 512
Created a chunk of size 576, which is longer than the specified 512
Created a chunk of size 811, which is longer th

2130

In [56]:
px.histogram(x=[len(chunk) for chunk in nltk_chunks]).show()

In [61]:
random_index = np.random.randint(0, len(nltk_chunks))
print(nltk_chunks[random_index])

This face has power outside the biological realm: much of college education may fall under its purview.

I nominate Ishtar as a name, ancient goddess of lust and war but not of love.

Ishtar is not a friendly goddess, not in a Yudkowskian sense, but she seems a little better for us than Moloch.

- Elizabeth says: 

			July 30, 2014 at 4:21 pm 
+1 this name for this concept.

Also, wow, Scott, this post was extraordinary.

276.


In [None]:
def multi_stage_splitter(text, chunk_size=512):
    chunks = []

    nltk_chunks = nltk_splitter.split_text(text)
    for chunk in nltk_chunks:
        if len(chunk) < chunk_size:
            chunks.append(chunk)
        else:
            recursive_chunks = recursive_splitter.split_text(chunk)
            chunks.extend(recursive_chunks)

    return chunks


multi_stage_chunks = multi_stage_splitter(md_result)
len(multi_stage_chunks)

Created a chunk of size 2192, which is longer than the specified 512
Created a chunk of size 808, which is longer than the specified 512
Created a chunk of size 533, which is longer than the specified 512
Created a chunk of size 663, which is longer than the specified 512
Created a chunk of size 613, which is longer than the specified 512
Created a chunk of size 559, which is longer than the specified 512
Created a chunk of size 858, which is longer than the specified 512
Created a chunk of size 858, which is longer than the specified 512
Created a chunk of size 913, which is longer than the specified 512
Created a chunk of size 531, which is longer than the specified 512
Created a chunk of size 841, which is longer than the specified 512
Created a chunk of size 667, which is longer than the specified 512
Created a chunk of size 537, which is longer than the specified 512
Created a chunk of size 576, which is longer than the specified 512
Created a chunk of size 811, which is longer th

2175

In [62]:
chunk_lengths = [len(chunk) for chunk in multi_stage_chunks]
px.histogram(x=chunk_lengths).show()


In [63]:
granite107 = HuggingFaceEmbeddings(
    model_name="ibm-granite/granite-embedding-107m-multilingual",
)

In [64]:
len(multi_stage_chunks), len(recursive_chunks), len(nltk_chunks)

(2175, 2298, 2130)

In [65]:
embeddings = granite107.embed_documents(multi_stage_chunks)