In [3]:
from indox.splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter, MarkdownTextSplitter, AI21SemanticTextSplitter

In [3]:
splitter = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=20)
text = """This is a long piece of text that we want to split into smaller chunks.
It contains multiple sentences and paragraphs. We'll use this to test our text splitter.

This is a new paragraph. It should be split on the paragraph boundary first.
Then, if needed, it will be split into smaller chunks."""

chunks = splitter.split_text(text)
for i, chunk in enumerate(chunks):
    print(f"Chunk {i + 1}:")
    print(chunk)
    print("-" * 40)


Chunk 1:
This is a long piece of text that we want to split into smaller chunks.
----------------------------------------
Chunk 2:
It contains multiple sentences and paragraphs. We'll use this to test our text splitter.
----------------------------------------
Chunk 3:
This is a new paragraph. It should be split on the paragraph boundary first.
----------------------------------------
Chunk 4:
Then, if needed, it will be split into smaller chunks.
----------------------------------------


In [4]:
splitter = CharacterTextSplitter(chunk_size=100, chunk_overlap=20)
text = """This is a long piece of text that we want to split into smaller chunks.
It contains multiple sentences and paragraphs. We'll use this to test our text splitter.

This is a new paragraph. It should be split on the paragraph boundary first.
Then, if needed, it will be split into smaller chunks."""

chunks = splitter.split_text(text)
for i, chunk in enumerate(chunks):
    print(f"Chunk {i + 1}:")
    print(chunk)
    print("-" * 40)

Chunk 1:
This is a long piece of text that we want to split into smaller chunks.
It contains multiple sentences and paragraphs. We'll use this to test our text splitter.
----------------------------------------
Chunk 2:
This is a new paragraph. It should be split on the paragraph boundary first.
Then, if needed, it will be split into smaller chunks.
----------------------------------------


In [5]:
markdown_text = """
# Main Title

## Section 1

This is content for section 1.

## Section 2

This is content for section 2.

### Subsection 2.1

More detailed content here.

## Section 3

Final section content.
"""

splitter = MarkdownTextSplitter(chunk_size=100, chunk_overlap=20)
chunks = splitter.split_text(markdown_text)

for i, chunk in enumerate(chunks):
    print(f"Chunk {i + 1}:")
    print(chunk)
    print("-" * 40)

Chunk 1:
# Main Title

## 
## Section 1

This is content for section 1.

##
----------------------------------------
Chunk 2:
Section 2

This is content for section 2.

### Subsection 2.1

More detailed content here.

##
----------------------------------------
Chunk 3:
Section 3

Final section content.
----------------------------------------


In [4]:
import os
from dotenv import load_dotenv

load_dotenv('api.env')
AI21_API_KEY = os.getenv('AI21_API_KEY')

In [5]:
TEXT = (
    "We’ve all experienced reading long, tedious, and boring pieces of text - financial reports, "
    "legal documents, or terms and conditions (though, who actually reads those terms and conditions to be honest?).\n"
    "Imagine a company that employs hundreds of thousands of employees. In today's information "
    "overload age, nearly 30% of the workday is spent dealing with documents. There's no surprise "
    "here, given that some of these documents are long and convoluted on purpose (did you know that "
    "reading through all your privacy policies would take almost a quarter of a year?). Aside from "
    "inefficiency, workers may simply refrain from reading some documents (for example, Only 16% of "
    "Employees Read Their Employment Contracts Entirely Before Signing!).\nThis is where AI-driven summarization "
    "tools can be helpful: instead of reading entire documents, which is tedious and time-consuming, "
    "users can (ideally) quickly extract relevant information from a text. With large language models, "
    "the development of those tools is easier than ever, and you can offer your users a summary that is "
    "specifically tailored to their preferences.\nLarge language models naturally follow patterns in input "
    "(prompt), and provide coherent completion that follows the same patterns. For that, we want to feed "
    'them with several examples in the input ("few-shot prompt"), so they can follow through. '
    "The process of creating the correct prompt for your problem is called prompt engineering, "
    "and you can read more about it here."
)

semantic_text_splitter = AI21SemanticTextSplitter()
chunks = semantic_text_splitter.split_text(TEXT)

print(f"The text has been split into {len(chunks)} chunks.")
for chunk in chunks:
    print(chunk)
    print("====")

The text has been split into 3 chunks.
We’ve all experienced reading long, tedious, and boring pieces of text - financial reports, legal documents, or terms and conditions (though, who actually reads those terms and conditions to be honest?).

Imagine a company that employs hundreds of thousands of employees.

In today's information overload age, nearly 30% of the workday is spent dealing with documents.

There's no surprise here, given that some of these documents are long and convoluted on purpose (did you know that reading through all your privacy policies would take almost a quarter of a year?).

Aside from inefficiency, workers may simply refrain from reading some documents (for example, Only 16% of Employees Read Their Employment Contracts Entirely Before Signing!).
====
This is where AI-driven summarization tools can be helpful: instead of reading entire documents, which is tedious and time-consuming, users can (ideally) quickly extract relevant information from a text.

With la