### **LIMITATIONS OF THE SIMPLE SPLITTING TECHNIQUES**
1. Splits are naive(not context aware)
    - Ignores context of surrounding text
2. Splits are made using charcters vs tokens
    - Tokens are processed by models
    - Risk exceeding the context window

In [1]:
import tiktoken
from langchain_text_splitters import TokenTextSplitter
example_string="Mary had a little lamb, it's fleece was white as snow"
encoding=tiktoken.encoding_for_model("gpt-4o-mini")
splitter=TokenTextSplitter(encoding_name=encoding.name,
                           chunk_size=10,#represent token
                           chunk_overlap=2)
chunks=splitter.split_text(example_string)
for i,chunk in enumerate(chunks):
    print(f"chunk {i+1}:\n {chunks[i]}\n")

  from .autonotebook import tqdm as notebook_tqdm


chunk 1:
 Mary had a little lamb, it's fleece was white

chunk 2:
  was white as snow



**Semantic Splitting**
<div style='color:red'>RAG applications has numerous use cases but are most frequently deployed in chatbots</div> <div style='color:blue'>Domestics dogs are descendent from an extinct population of Pleistocene wolves over 14000 years ago.</div>

In [2]:
from langchain_experimental.text_splitter import SemanticChunker
from langchain_huggingface import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)

In [3]:
semantic_splitter=SemanticChunker(
    embeddings=embeddings,
    breakpoint_threshold_type='gradient',#metrics at which embeddings are compared 
    breakpoint_threshold_amount=0.8#threshold at which the splitting to be made to keep semantically related chunks
    
)

In [4]:
from langchain_community.document_loaders import PyPDFLoader
loader=PyPDFLoader("C:/Users/Acer/OneDrive/Documents/books/python/python-crash-course.pdf")

In [5]:
data=loader.load()

In [6]:
chunks=semantic_splitter.split_documents(data)
print(chunks[0])

page_content='A HANDS-ON , PROJECT-BASED
INTRODUCTION TO PROGRAMMING
ERIC MATTHES
P Y THON
C R ASH COURSE
P Y THON
C R ASH COURSE
SHELVE IN:
PROGRAMMING LANGUAGES/
PYTHON
$39.95 ($45.95 CDN)
FAST!' metadata={'producer': 'Adobe PDF Library 9.9', 'creator': 'Adobe InDesign CS5.5 (7.5.3)', 'creationdate': '2015-10-26T15:01:49-07:00', 'moddate': '2015-10-27T15:56:06-07:00', 'trapped': '/False', 'source': 'C:/Users/Acer/OneDrive/Documents/books/python/python-crash-course.pdf', 'total_pages': 562, 'page': 0, 'page_label': 'i'}
