## How to split by character
This is the simplest method for splitting text. This splits based on a given character sequence, which defaults to "\n\n". Chunk length is measured by number of characters.
How the text is split: by single character separator.
How the chunk size is measured: by number of characters.

To obtain the string content directly, use .splitText().


The output indicates that the CharacterTextSplitter is creating chunks larger than the specified chunk_size of 20 characters. This usually happens when the splitter cannot find the specified separator within the given chunk size, so it defaults to creating a larger chunk rather than arbitrarily cutting the text.

In [4]:
# Loading the document
from langchain_community.document_loaders import PyPDFLoader
text = PyPDFLoader('Cover.pdf')
documents = text.load()

# Extracting the text from the loaded documents
document_text = documents[0].page_content  # Assuming a single-page PDF

# Splitting the document into chunks using CharacterTextSplitter
from langchain.text_splitter import CharacterTextSplitter
chunks = CharacterTextSplitter(
    chunk_size=20,
    chunk_overlap=4,
    separator='and'  # Using newline as the separator
)

chunked_text = chunks.split_text(document_text)  # Split the single document's text
print(chunked_text[0])


ValueError: File path Cover.pdf is not a valid file or url

## How to recursively split text by characters
This text splitter is the recommended one for generic text. It is parameterized by a list of characters. It tries to split on them in order until the chunks are small enough. The default list is ["\n\n", "\n", " ", ""]. This has the effect of trying to keep all paragraphs (and then sentences, and then words)
1. How the text is split: by list of characters.
2. How the chunk size is measured: by number of characters.

To obtain the string content directly, use .splitText.



In [8]:
pip install lnagchain langchain-community

Note: you may need to restart the kernel to use updated packages.


ERROR: Could not find a version that satisfies the requirement lnagchain (from versions: none)
ERROR: No matching distribution found for lnagchain


In [11]:
# loading document
from langchain_community.document_loaders import TextLoader
myfile=TextLoader('text.txt')
loader=myfile.load()
loader()

TypeError: 'list' object is not callable

In [47]:
from langchain_community.document_loaders import PyPDFLoader

# Load the PDF document
myfile = PyPDFLoader('samplepdf.pdf')
documents = myfile.load()

# Display the contents of the documents
for doc in documents:
    print(doc)


page_content=' 
University Institute of Information Technology,  
PMAS -Arid Agriculture University,  
Rawalpindi Pakistan  
 
PhysioFlex -personal Posture Trainer  
 
 
By 
 
Hamza Akram         20-ARID -763 
M Fahad Bashir       20 -ARID -790 
Syeda Moniza           20 -ARID -833 
 
 
Supervisor  
Dr. Ruqia Bibi  
Bachelor of Science in Software Engineering  (2020-2024) 
 ' metadata={'source': 'samplepdf.pdf', 'page': 0}
page_content='II 
  
 
The candidate confirms that the work submitted is their own and appropriate  
 credit has been given where reference has been made to the work of others . 
 
DECLARATION  
 
We hereby declare that this software, neither whole nor as a part has been copied out from any 
source. It is further declared that we have developed this software documentation and 
accompanied report entirely on the basis of our personal efforts. If any part of this project is 
proved to be copied out from any source or found to be reproduction of some other. We will 
sta

In [54]:
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Load the PDF document
myfile = PyPDFLoader('samplepdf.pdf')
documents = myfile.load()
documents



[Document(metadata={'source': 'samplepdf.pdf', 'page': 0}, page_content=' \nUniversity Institute of Information Technology,  \nPMAS -Arid Agriculture University,  \nRawalpindi Pakistan  \n \nPhysioFlex -personal Posture Trainer  \n \n \nBy \n \nHamza Akram         20-ARID -763 \nM Fahad Bashir       20 -ARID -790 \nSyeda Moniza           20 -ARID -833 \n \n \nSupervisor  \nDr. Ruqia Bibi  \nBachelor of Science in Software Engineering  (2020-2024) \n '),
 Document(metadata={'source': 'samplepdf.pdf', 'page': 1}, page_content='II \n  \n \nThe candidate confirms that the work submitted is their own and appropriate  \n credit has been given where reference has been made to the work of others . \n \nDECLARATION  \n \nWe hereby declare that this software, neither whole nor as a part has been copied out from any \nsource. It is further declared that we have developed this software documentation and \naccompanied report entirely on the basis of our personal efforts. If any part of this project

In [55]:


# Combine all the extracted text into a single string
full_text = "\n".join([doc.page_content for doc in documents])

# Create a RecursiveCharacterTextSplitter instance
chunker = RecursiveCharacterTextSplitter(
    chunk_size=200,
    separators=["\n\n", ".", "!", "?", ",", " ", ""],
    chunk_overlap=20
)

# Split the full text into chunks
chunks = chunker.split_text(full_text)

# Print out the chunks
for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1}:")
    print(chunk)
    print("-" * 40)

Chunk 1:
University Institute of Information Technology,  
PMAS -Arid Agriculture University
----------------------------------------
Chunk 2:
,  
Rawalpindi Pakistan  
 
PhysioFlex -personal Posture Trainer  
 
 
By 
 
Hamza Akram         20-ARID -763 
M Fahad Bashir       20 -ARID -790 
Syeda Moniza           20 -ARID -833 
 
 
Supervisor
----------------------------------------
Chunk 3:
Supervisor  
Dr
----------------------------------------
Chunk 4:
. Ruqia Bibi  
Bachelor of Science in Software Engineering  (2020-2024) 
 
II 
  
 
The candidate confirms that the work submitted is their own and appropriate  
 credit has been given where reference
----------------------------------------
Chunk 5:
where reference has been made to the work of others
----------------------------------------
Chunk 6:
. 
 
DECLARATION  
 
We hereby declare that this software, neither whole nor as a part has been copied out from any 
source
----------------------------------------
Chunk 7:
. It is furthe

### Recursively Splitting Text by Characters
The idea behind recursively splitting text by characters is to break down large pieces of text into smaller, manageable chunks based on a specified set of characters (like periods, commas, etc.). This is important because some models might have a maximum token limit, so you want to ensure that each chunk is within those limits.
here are the steps 
1. Import the Required Splitter: LangChain provides a RecursiveCharacterTextSplitter class for this purpose.

2. Create an Instance of the Splitter: You can specify the set of characters you want to split by (e.g., periods, commas) and the maximum chunk size.

3. Split the Text: Use the splitter to split the loaded text into chunks.

In [12]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter=RecursiveCharacterTextSplitter(
    chunk_size=20,
    chunk_overlap=4,
    separators=["\n\n",".","!","?",","," ",""]
)
#load the document
document=loader()
# split doc into chunks
splited_text=splitter.split_documents(document)

# write a function to print all
print(f"number of chunks : {len(splited_text)}")

for i in splited_text:
    print(i)
    

TypeError: 'list' object is not callable

## How to split code

In [25]:

from langchain_text_splitters import RecursiveCharacterTextSplitter

#loading python file
with open('app.py' , 'r') as file:
    code=file.read()
splitter = RecursiveCharacterTextSplitter(
    chunk_size=50,   # Adjust based on the size of your code chunks
    chunk_overlap=10,
    separators=["\n\n", "\n", ";", "}", "{", " "]
)

# Split the code into chunks
chunked_code = splitter.split_text(code)

# Output the number of chunks and a sample chunk
print(f"Number of chunks: {len(chunked_code)}")
for i in chunked_code:
    print("Chunk : ", i)


Number of chunks: 6
Chunk :  def example_function():
    for i in range(10):
Chunk :  print(i)
    return 'Done'
Chunk :  class ExampleClass:
    def __init__(self):
Chunk :  self.value = 0
Chunk :  def increment(self):
        self.value += 1
Chunk :  return self.value


In [31]:
from langchain.text_splitter import RecursiveCharacterTextSplitter

# Define your Python code
PYTHON_CODE = """
def hello_world():
    print("Hello, World!")

    def add(a, b):
    return a + b
hello_world()
result = add(3, 5)
print(f"Result: {result}")
"""

# Create a RecursiveCharacterTextSplitter instance
python_splitter = RecursiveCharacterTextSplitter.from_language(
    language="python", 
    chunk_size=50, 
    chunk_overlap=10  # overlap between chunks to maintain context
)

# Split the code into chunks
chunks = python_splitter.split_text(PYTHON_CODE)
chunks


['def hello_world():\n    print("Hello, World!")',
 'def add(a, b):\n    return a + b',
 'hello_world()\nresult = add(3, 5)',
 'print(f"Result: {result}")']

Splitters Examples # 1 


In [2]:
from langchain.text_splitter import RecursiveCharacterTextSplitter, CharacterTextSplitter

In [15]:
text="Once you've loaded documents, you'll often want to transform them to better suit your application. The simplest example is you may want to split a long document into smaller chunks that can fit into your model's context window"
r_splitter=RecursiveCharacterTextSplitter(
    chunk_size=20,
    separators=['\n','.',' ',],
    chunk_overlap=4
)
r_splitter.split_text(text)

["Once you've loaded",
 "documents, you'll",
 'often want to',
 'to transform them',
 'to better suit your',
 'application',
 '. The simplest',
 'example is you may',
 'may want to split a',
 'a long document',
 'into smaller chunks',
 'that can fit into',
 "your model's",
 'context window']

In [44]:
from langchain.text_splitter import CharacterTextSplitter

# Define the text to split
text = "This is an example. of how character .splitter works. This is the start.  "

# Create a CharacterTextSplitter instance
char_splitter = CharacterTextSplitter(
    chunk_size=25,
    separator='.',  # Corrected the typo
    chunk_overlap=3
)

# Split the text into chunks
chunks = char_splitter.split_text(text)

# Display the resulting chunks
chunks


['This is an example',
 'of how character',
 'splitter works',
 'This is the start.']