#### Text Splitter Practice

In [23]:
from langchain_text_splitters import CharacterTextSplitter, RecursiveCharacterTextSplitter, Language
from langchain_community.document_loaders import TextLoader
from langchain_experimental.text_splitter import SemanticChunker
from langchain_huggingface import HuggingFaceEmbeddings

In [4]:
# Sample Txt from DocLoaders Practie
file_path = "../DocLoadersPractice/files/sample.txt"
text_loader = TextLoader(file_path)
documents = text_loader.load()
for doc in documents:
    print(doc.metadata)
    print(doc.page_content)

{'source': '../DocLoadersPractice/files/sample.txt'}
Hi this is sample text for LangChain DocLoader practice.
It contains multiple lines.
The HuggingFacePipeline in LangChain is a wrapper class that allows you to use models from the Hugging Face transformers library, run them locally, and integrate them into LangChain applications. 
Overview
Purpose: The class wraps the Hugging Face pipeline function, enabling seamless use of a wide range of open-source models (like GPT-2 or T5) within the LangChain framework.
Local Execution: It runs models locally on your computer, making it suitable for privacy-sensitive applications or environments where an internet connection is unavailable after the initial download.
Supported Tasks: It primarily supports text-centric tasks such as text generation, text-to-text generation, and summarization. 


Length Based Splitting

In [7]:
lb_splitter = CharacterTextSplitter(
    chunk_size=100,
    chunk_overlap=0,
    separator="\n",
)
lb_result = lb_splitter.split_documents(documents)
for i, doc in enumerate(lb_result):
    print(f"\n--- Document {i+1} ---")
    print(doc.page_content)

Created a chunk of size 196, which is longer than the specified 100
Created a chunk of size 171, which is longer than the specified 100
Created a chunk of size 199, which is longer than the specified 100



--- Document 1 ---
Hi this is sample text for LangChain DocLoader practice.
It contains multiple lines.

--- Document 2 ---
The HuggingFacePipeline in LangChain is a wrapper class that allows you to use models from the Hugging Face transformers library, run them locally, and integrate them into LangChain applications.

--- Document 3 ---
Overview

--- Document 4 ---
Purpose: The class wraps the Hugging Face pipeline function, enabling seamless use of a wide range of open-source models (like GPT-2 or T5) within the LangChain framework.

--- Document 5 ---
Local Execution: It runs models locally on your computer, making it suitable for privacy-sensitive applications or environments where an internet connection is unavailable after the initial download.

--- Document 6 ---
Supported Tasks: It primarily supports text-centric tasks such as text generation, text-to-text generation, and summarization.


In [8]:
lb_splitter = CharacterTextSplitter(
    chunk_size=100,
    chunk_overlap=0,
    separator="",
)
lb_result = lb_splitter.split_documents(documents)
for i, doc in enumerate(lb_result):
    print(f"\n--- Document {i+1} ---")
    print(doc.page_content)


--- Document 1 ---
Hi this is sample text for LangChain DocLoader practice.
It contains multiple lines.
The HuggingFace

--- Document 2 ---
Pipeline in LangChain is a wrapper class that allows you to use models from the Hugging Face transfo

--- Document 3 ---
rmers library, run them locally, and integrate them into LangChain applications. 
Overview
Purpose:

--- Document 4 ---
The class wraps the Hugging Face pipeline function, enabling seamless use of a wide range of open-so

--- Document 5 ---
urce models (like GPT-2 or T5) within the LangChain framework.
Local Execution: It runs models local

--- Document 6 ---
ly on your computer, making it suitable for privacy-sensitive applications or environments where an

--- Document 7 ---
internet connection is unavailable after the initial download.
Supported Tasks: It primarily support

--- Document 8 ---
s text-centric tasks such as text generation, text-to-text generation, and summarization.


Text Structure Based Splitting

In [12]:
ts_splitter = RecursiveCharacterTextSplitter(
    chunk_size=200,
    chunk_overlap=0,
)

ts_result = ts_splitter.split_documents(documents)
for i, doc in enumerate(ts_result):
    print(f"\n--- Document {i+1} ---")
    print(doc.page_content)


--- Document 1 ---
Hi this is sample text for LangChain DocLoader practice.
It contains multiple lines.

--- Document 2 ---
The HuggingFacePipeline in LangChain is a wrapper class that allows you to use models from the Hugging Face transformers library, run them locally, and integrate them into LangChain applications.

--- Document 3 ---
Overview
Purpose: The class wraps the Hugging Face pipeline function, enabling seamless use of a wide range of open-source models (like GPT-2 or T5) within the LangChain framework.

--- Document 4 ---
Local Execution: It runs models locally on your computer, making it suitable for privacy-sensitive applications or environments where an internet connection is unavailable after the initial download.

--- Document 5 ---
Supported Tasks: It primarily supports text-centric tasks such as text generation, text-to-text generation, and summarization.


In [13]:
ts_splitter = RecursiveCharacterTextSplitter(
    chunk_size=100,
    chunk_overlap=0,
)

ts_result = ts_splitter.split_documents(documents)
for i, doc in enumerate(ts_result):
    print(f"\n--- Document {i+1} ---")
    print(doc.page_content)


--- Document 1 ---
Hi this is sample text for LangChain DocLoader practice.
It contains multiple lines.

--- Document 2 ---
The HuggingFacePipeline in LangChain is a wrapper class that allows you to use models from the

--- Document 3 ---
Hugging Face transformers library, run them locally, and integrate them into LangChain

--- Document 4 ---
applications.

--- Document 5 ---
Overview

--- Document 6 ---
Purpose: The class wraps the Hugging Face pipeline function, enabling seamless use of a wide range

--- Document 7 ---
of open-source models (like GPT-2 or T5) within the LangChain framework.

--- Document 8 ---
Local Execution: It runs models locally on your computer, making it suitable for privacy-sensitive

--- Document 9 ---
applications or environments where an internet connection is unavailable after the initial

--- Document 10 ---
download.

--- Document 11 ---
Supported Tasks: It primarily supports text-centric tasks such as text generation, text-to-text

--- Document 12 ---

Markdown Splitting

In [14]:
Markdown_Sample = """
### Sample Markdown Document
This is a sample markdown document to demonstrate the MarkdownHeaderTextSplitter.
# Header 1
This is some content under header 1.
"""

In [15]:
md_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.MARKDOWN,
    chunk_size=100,
    chunk_overlap=0,
)
md_result = md_splitter.split_text(Markdown_Sample)
for i, chunk in enumerate(md_result):
    print(f"\n--- Chunk {i+1} ---")
    print(chunk)


--- Chunk 1 ---
### Sample Markdown Document

--- Chunk 2 ---
This is a sample markdown document to demonstrate the MarkdownHeaderTextSplitter.

--- Chunk 3 ---
# Header 1
This is some content under header 1.


Python Code Splitting

In [17]:
sample_code = """
def greet(name):
    print(f"Hello, {name}!")
greet("World")
"""
code_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON,
    chunk_size=50,
    chunk_overlap=0,
)
code_result = code_splitter.split_text(sample_code)
for i, chunk in enumerate(code_result):
    print(f"\n--- Code Chunk {i+1} ---")
    print(chunk)


--- Code Chunk 1 ---
def greet(name):
    print(f"Hello, {name}!")

--- Code Chunk 2 ---
greet("World")


Semantic Splitting

In [27]:
import os
cache_dir = 'D:/Development/ML/Deep Learning/GenAI/.hf_cache'
os.environ['HF_HOME'] = cache_dir
os.environ['TRANSFORMERS_CACHE'] = cache_dir
os.environ['HF_DATASETS_CACHE'] = cache_dir
os.environ["SENTENCE_TRANSFORMERS_HOME"] = cache_dir
os.makedirs(cache_dir, exist_ok=True)

SentenceModel = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2",
)

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


In [38]:
smn_chunker = SemanticChunker(
    embeddings=SentenceModel,
    breakpoint_threshold_type='standard_deviation',
    breakpoint_threshold_amount=1
)
docs = [doc.page_content for doc in documents]
smn_res = smn_chunker.create_documents(docs)
for i, doc in enumerate(smn_res):
    print(f"\n--- Semantic Chunk {i+1} ---")
    print(doc.page_content)


--- Semantic Chunk 1 ---
Hi this is sample text for LangChain DocLoader practice. It contains multiple lines. The HuggingFacePipeline in LangChain is a wrapper class that allows you to use models from the Hugging Face transformers library, run them locally, and integrate them into LangChain applications. Overview
Purpose: The class wraps the Hugging Face pipeline function, enabling seamless use of a wide range of open-source models (like GPT-2 or T5) within the LangChain framework. Local Execution: It runs models locally on your computer, making it suitable for privacy-sensitive applications or environments where an internet connection is unavailable after the initial download.

--- Semantic Chunk 2 ---
Supported Tasks: It primarily supports text-centric tasks such as text generation, text-to-text generation, and summarization. 
