### Fixed-size Splitting

Character splitting is the most fundamental method of breaking down your text. It involves dividing the text into chunks of a specified number of characters, regardless of the content or structure.

While this method is generally not recommended for practical applications, it serves as an excellent starting point for understanding the basics of text segmentation.

#### Key Concepts:

- **Chunk Size**: This is the number of characters you want each chunk to contain. It can be any number, such as 50, 100, or even 100,000 characters.

- **Chunk Overlap**: This refers to the number of characters that overlap between consecutive chunks. Overlapping helps to prevent splitting a single context into multiple pieces, although it does introduce some redundancy across chunks.

In [None]:
#read text from sample-text.txt and print it to the console
FILE_NAME="sample-text.txt"
with open(FILE_NAME, "r", encoding="utf-8") as file:
        text=file.read()
print(text)

In [None]:
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(chunk_size = 50, chunk_overlap=0, separator='', strip_whitespace=False)
text_splitter.create_documents([text])

In [None]:
text_splitter = CharacterTextSplitter(chunk_size = 50, chunk_overlap=10, separator='', strip_whitespace=False)
text_splitter.create_documents([text])

In [None]:
text_splitter = CharacterTextSplitter(chunk_size = 50, chunk_overlap=0, separator='me', strip_whitespace=False)
text_splitter.create_documents([text])

### Recursive Character Text Splitting

With Character Text Splitter we split text by simply dividing the document based on a fixed number of characters. While straightforward, this method doesn't account for the document's inherent structure.

The Recursive Character Text Splitter addresses this limitation by allowing us to specify a series of separators to intelligently split our documents. This method takes into account various structural elements, resulting in a more context-aware split.

#### Default Separators in LangChain

Let's examine the default separators used in LangChain:

- `"\n\n"`: Double new line, commonly indicating paragraph breaks.
- `"\n"`: Single new line.
- `" "`: Spaces between words.
- `""`: Individual characters.

Period (`"."`) is not included in the default list of separators. 

#### Why Choose Recursive Character Text Splitter?

The Recursive Character Text Splitter is a versatile tool, often my go-to when prototyping a quick application. Its flexibility in handling various separators makes it an excellent first choice if you're unsure which splitter to use.

By understanding and leveraging the structure of your document, this splitter can produce more meaningful and contextually appropriate splits, enhancing the overall processing and analysis of your text data.

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 50, chunk_overlap=0)
text_splitter.create_documents([text])

After splitting the text into paragraphs, the process evaluates the size of each chunk. If a chunk is too large, it will attempt to divide it using the next available separator. Should the chunk remain too large, the process will continue to the subsequent separator, repeating this until an appropriate size is achieved.

Given the length of this text, we should consider splitting it using larger separators.

In [None]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 500, chunk_overlap=0)
text_splitter.create_documents([text])

### Document-Specific Splitting

It's important to address document types that go beyond simple text files. What if your documents contain images, PDFs, or code snippets? Our initial two levels of chunking strategies may not be effective for these diverse formats, so we need to adopt a different approach.

This level focuses on tailoring your chunking strategy to fit various data formats. Let's explore several examples to illustrate this in practice:

#### Markdown, Python, and JavaScript Splitters

For Markdown, Python, and JavaScript files, the splitters will resemble the Recursive Character method but will use different separators tailored to each format.

##### Markdown Splitter
Markdown files often contain headings, lists, code blocks, and links. A specialized splitter can use these elements as natural breakpoints.

- **Headings**: Split at `#`, `##`, `###`, etc.
- **Lists**: Split at `-`, `*`, `1.`, etc.
- **Code Blocks**: Split at triple backticks ``` ````
- **Links**: Split at `[text](url)`

##### Python Splitter
Python files have distinct structural elements such as function definitions, class definitions, and comments.

- **Function Definitions**: Split at `def`
- **Class Definitions**: Split at `class`
- **Comments**: Split at `#`

##### JavaScript Splitter
JavaScript files also have unique structural features like function declarations, import statements, and comments.

- **Function Declarations**: Split at `function`
- **Import Statements**: Split at `import`
- **Comments**: Split at `//` for single-line comments and `/*...*/` for multi-line comments

#### Handling Other Formats

##### Images
Images can be split based on metadata or by grouping related images together. However, splitting images isn't usually necessary unless you're dealing with image datasets.

##### PDFs
PDFs are complex documents that can contain text, images, and vector graphics. A PDF splitter can use the following strategies:

- **Pages**: Split by individual pages
- **Headings**: Use text headings to define sections
- **Paragraphs**: Split by paragraphs for more granularity

##### Code Snippets
Code snippets can be language-specific, so the splitting strategy should account for the syntax and structure of the particular language.

- **Blocks**: Split at logical code blocks or functions
- **Comments**: Use comments as natural breakpoints
- **Imports/Includes**: Split at import or include statements

By customizing your chunking strategy to fit the specific format of your documents, you can better manage and process diverse types of data. This tailored approach ensures that you handle different data formats effectively, maintaining the integrity and meaning of the original content.

In [None]:
MD_FILE_NAME="sample-markdown.md"
with open(MD_FILE_NAME, "r", encoding="utf-8") as file:
        markdown_txt=file.read()
print(markdown_txt)

In [None]:
from langchain.text_splitter import MarkdownTextSplitter
splitter = MarkdownTextSplitter(chunk_size = 100, chunk_overlap=0)
splitter.create_documents([markdown_txt])

### Semantic Chunking

Semantic Chunking involves dividing text based on semantic similarity. This technique helps in creating more coherent and contextually relevant text segments.

#### Overview
The process begins by splitting the text into individual sentences. These sentences are then grouped into clusters of three. Subsequently, similar clusters are merged based on their proximity in the embedding space, ensuring that the final chunks are semantically cohesive.

In [None]:
%pip install --quiet langchain_experimental langchain_openai

In [None]:
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import AzureOpenAIEmbeddings
#from langchain_openai.embeddings import OpenAIEmbeddings
import os
from dotenv import load_dotenv


load_dotenv()

FILE_NAME="samp1.txt"
with open(FILE_NAME, "r", encoding="utf-8") as file:
        text=file.read()

azure_openai_endpoint = os.environ["AZURE_OPENAI_ENDPOINT"]
azure_openai_key = os.environ["AZURE_OPENAI_API_KEY"] if len(os.environ["AZURE_OPENAI_API_KEY"]) > 0 else None
azure_openai_embedding_deployment = os.environ["AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME"]
embedding_model_name = os.environ["AZURE_OPENAI_EMBEDDING_MODEL"]
azure_openai_api_version = os.environ["OPENAI_API_VERSION"]
openai_api_type = os.environ["OPENAI_API_TYPE"]



embeddings = AzureOpenAIEmbeddings(
    model=embedding_model_name,
    deployment=embedding_model_name,
    openai_api_type = openai_api_type,
    openai_api_version = azure_openai_api_version,
    azure_endpoint = azure_openai_endpoint,
    openai_api_key = azure_openai_key,
    embedding_ctx_length=8191,
    chunk_size=1000,
    max_retries=6
)

text_splitter = SemanticChunker(AzureOpenAIEmbeddings())
docs = text_splitter.create_documents([text])
print(docs[0].page_content)

In [None]:
print(docs[1].page_content)