#Large Language Model Chunkizer
##Introduction
In this notebook, I demonstrate how __LLMChunkizerLib__ leverages a Large Language Model (LLM) to split text (even from large documents) into coherent chunks that preserve the same concept or idea.

##Import Library
This notebook leverages LangChain and the OpenAI model deployed on Azure.

First, we import the necessary standard libraries, including os, langchain, and dotenv.

Next, we import my llm_chunkizer class, which provides several static methods essential for split document.

In [1]:
import os
from langchain_openai.chat_models.azure import AzureChatOpenAI
from dotenv import load_dotenv
from LLMChunkizerLib.chunkizer import llm_chunkizer

##Setting variables
Following that, we need to import the necessary variables required for utilizing Azure OpenAI.

In [2]:
load_dotenv()
azure_deployment = os.getenv("AZURE_DEPLOYMENT")
temperature = float(os.getenv("TEMPERATURE"))
api_key  = os.getenv("AZURE_OPENAI_API_KEY")
endpoint = os.getenv("AZURE_OPENAI_ENDPOINT")
api_version = os.getenv("API_VERSION")

##Define database
In a real-world scenario, I obtain paragraphs from a 30-page Word document. However, to simplify this example, I will create a list containing three paragraphs from __Around the World in Eighty Days__.

In [3]:
documents = [ 
     """On October 2, 1872, Phileas Fogg, an English gentleman, left London for an extraordinary journey. 
	He had wagered that he could circumnavigate the globe in just eighty days. 
	Fogg was a man of strict habits and a very methodical life; everything was planned down to the smallest detail, and nothing was left to chance.
	He departed London on a train to Dover, then crossed the Channel by ship. His journey took him through many countries, 
	including France, India, Japan, and America. At each stop, he encountered various people and faced countless adventures, but his determination never wavered.""",

    """However, time was his enemy, and any delay risked losing the bet. With the help of his faithful servant Passepartout, Fogg had to face 
	unexpected obstacles and dangerous situations.""",
	"""Yet, each time, his cunning and indomitable spirit guided him to victory, while the world watched in disbelief.""",

    """With one final effort, Fogg and Passepartout reached London just in time to prove that they had completed their journey in less than eighty days. 
	This extraordinary adventurer not only won the bet but also discovered that the true treasure was the friendship and experiences he had accumulated along the way."""
]

##Initiate LLM
Now I create an AzureOpenAI LLM GPT-4o . 

In [4]:
# Initialize the LLM
llm = AzureChatOpenAI(api_key=api_key, azure_endpoint=endpoint, azure_deployment=azure_deployment, api_version=api_version,temperature=temperature)

##Block creation
Now, I need to take the paragraphs and transform them into blocks of text, each with a maximum size of 200 tokens. The block size is arbitrary, and in a real-world scenario, I typically consider block sizes ranging from 3,000 to 5,000 tokens.

In [7]:
refined_blocks = llm_chunkizer.split_document_into_blocks(documents, 200)

Print blocks

In [None]:
for idx, block in enumerate(refined_blocks):
    if (block.strip() != ''):
        print(f"{idx}: {block}")

#Chunkize blocks

In the chunk_text_with_llm function, I split the block into chunks and address the potential issue of adjacent paragraphs that convey the same idea but were initially separated into distinct blocks.
This is important because splitting related information can lead to a loss of context and negatively affect the understanding of the content when processed by the model.
To mitigate this, I take the last two chunks (if they exist) derived from the current block and append them to the beginning of the next block before analyzing it.
This ensures that related concepts are kept together, preserving their context and improving the overall coherence of the information. This process is repeated for all remaining blocks.            

In [None]:
final_chunks = llm_chunkizer.chunk_text_with_llm(llm, refined_blocks)
for idx, chunk in enumerate(final_chunks):
    if (chunk.strip() != ''):
        print(f"{idx}: {chunk}")

You can see how the database has been split into six distinct chunks.

0: On October 2, 1872, Phileas Fogg, an English gentleman, left London for an extraordinary journey. He had wagered that he could circumnavigate the globe in just eighty days. Fogg was a man of strict habits and a very methodical life; everything was planned down to the smallest detail, and nothing was left to chance. 
1:  He departed London on a train to Dover, then crossed the Channel by ship. His journey took him through many countries, including France, India, Japan, and America. At each stop, he encountered various people and faced countless adventures, but his determination never wavered. 
2: However, time was his enemy, and any delay risked losing the bet. With the help of his faithful servant Passepartout, Fogg had to face unexpected obstacles and dangerous situations. 
3: Yet, each time, his cunning and indomitable spirit guided him to victory, while the world watched in disbelief.
4: With one final effort, Fogg and Passepartout reached London just in time to prove that they had completed their journey in less than eighty days.
5: This extraordinary adventurer not only won the bet but also discovered that the true treasure was the friendship and experiences he had accumulated along the way.

Let's see what happens when I split the original database into larger blocks.

In [None]:
refined_blocks = llm_chunkizer.split_document_into_blocks(documents, 1000)

final_chunks = llm_chunkizer.chunk_text_with_llm(llm, refined_blocks)
for idx, chunk in enumerate(final_chunks):
    if (chunk.strip() != ''):
        print(f"{idx}: {chunk}")

With a larger block size, the system generates 4 chunks instead of 6. This behavior is expected, as the prompt responsible for dividing the text into chunks analyzed a larger portion of text at once and was able to create fewer chunks by using more text to represent a single concept.

0: On October 2, 1872, Phileas Fogg, an English gentleman, left London for an extraordinary journey. He had wagered that he could circumnavigate the globe in just eighty days. Fogg was a man of strict habits and a very methodical life; everything was planned down to the smallest detail, and nothing was left to chance. 
1: He departed London on a train to Dover, then crossed the Channel by ship. His journey took him through many countries, including France, India, Japan, and America. At each stop, he encountered various people and faced countless adventures, but his determination never wavered.
2: However, time was his enemy, and any delay risked losing the bet. With the help of his faithful servant Passepartout, Fogg had to face unexpected obstacles and dangerous situations. Yet, each time, his cunning and indomitable spirit guided him to victory, while the world watched in disbelief.
3: With one final effort, Fogg and Passepartout reached London just in time to prove that they had completed their journey in less than eighty days. This extraordinary adventurer not only won the bet but also discovered that the true treasure was the friendship and experiences he had accumulated along the way.


#Final Thoughts
Ultimately, it's important to perform multiple chunking attempts, varying the block size passed to the chunkizer each time. It is essential to review the results after each attempt to determine which approach best suits the desired outcome. 