In [None]:
FILE_NAME="sample-text.txt"
with open(FILE_NAME, "r", encoding="utf-8") as file:
        text=file.read()
print(text)

# Sample 1 

#### Character splitting 

Key Concepts:

- **Chunk Size**: This is the number of characters you want each chunk to contain. It can be any number, such as 50, 100, or even 100,000 characters.

- **Chunk Overlap**: This refers to the number of characters that overlap between consecutive chunks. Overlapping helps to prevent splitting a single context into multiple pieces, although it does introduce some redundancy across chunks.

In [None]:
from langchain.text_splitter import CharacterTextSplitter
text_splitter = CharacterTextSplitter(chunk_size = 50, chunk_overlap=0, separator='', strip_whitespace=False)
text_splitter.create_documents([text])

In [None]:
text_splitter = CharacterTextSplitter(chunk_size = 50, chunk_overlap=10, separator='me', strip_whitespace=False)
text_splitter.create_documents([text])

# Sample 2

## Recursive Character Text Splitter

#### Default Separators in LangChain:

- `"\n\n"`: Double new line, commonly indicating paragraph breaks.
- `"\n"`: Single new line.
- `" "`: Spaces between words.
- `""`: Individual characters.

Period (`"."`) is not included in the default list of separators. This is because periods are often used in abbreviations and numbers, and splitting on periods can lead to incorrect chunking. However, you can add periods to the list of separators if you want to split on periods as well.
After splitting the text into paragraphs, the process evaluates the size of each chunk. If a chunk is too large, it will attempt to divide it using the next available separator. Should the chunk remain too large, the process will continue to the subsequent separator, repeating this until an appropriate size is achieved.

In [None]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 50, chunk_overlap=0)
text_splitter.create_documents([text])

In [None]:
text_splitter = RecursiveCharacterTextSplitter(chunk_size = 550, chunk_overlap=0)
text_splitter.create_documents([text])

# Sample 3

#### Specialized Chunking

For Markdown, Python, and JavaScript files, the splitters will resemble the Recursive Character method but will use different separators tailored to each format.

##### Markdown Splitter
Markdown files often contain headings, lists, code blocks, and links. A specialized splitter can use these elements as natural breakpoints.

- **Headings**: Split at `#`, `##`, `###`, etc.
- **Lists**: Split at `-`, `*`, `1.`, etc.
- **Code Blocks**: Split at triple backticks ``` ````
- **Links**: Split at `[text](url)`



In [None]:
import tiktoken
encoding = tiktoken.get_encoding("cl100k_base")

def num_tokens_from_string(string: str) -> int:
    return len(encoding.encode(string))

def print_chunks_page_content(page_content,sparse=False):
    print(f"Number of chunks: {len(page_content)}")
    for i, chunk in enumerate(page_content):
        print(f"Chunk {i + 1} character count: {len(chunk.page_content)} token number: {num_tokens_from_string(chunk.page_content)}" )
        if not sparse:
            print(chunk.page_content)        
        else:
            print(chunk.page_content [:50])
        print("Meta data: ", chunk.metadata)
        print()

MD_FILE_NAME="sample-markdown.md"
with open(MD_FILE_NAME, "r", encoding="utf-8") as file:
        markdown_txt=file.read()
print(markdown_txt)

In [None]:
from langchain.text_splitter import MarkdownTextSplitter
splitter = MarkdownTextSplitter(chunk_size = 500, chunk_overlap=0)
md_splits=splitter.create_documents([markdown_txt])
print("Length of splits: " + str(len(md_splits)))
print_chunks_page_content(md_splits)

We can use MarkdownHeaderTextSplitter to split the text into chunks based on the headings in the text. This will allow us to analyze the text in smaller, more manageable pieces.

In [None]:
from langchain.text_splitter import MarkdownHeaderTextSplitter
headers_to_split_on = [
    ("#", "Header 1"),
    ("##", "Header 2"),
    ("###", "Header 3"),
    ("####", "Header 4"),
    ("#####", "Header 5"),
    ("######", "Header 6"),  
    ("#######", "Header 7"), 
    ("########", "Header 8")
]
md_text_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on, strip_headers=False)
md_header_splits = md_text_splitter.split_text(markdown_txt)

print("Length of splits: " + str(len(md_header_splits)))
print_chunks_page_content(md_header_splits)