## Text Splitting from Documents- RecursiveCharacter Text Splitter

This text splitter is the recommended one for generic text. It is parameterized by a list of characters. It tries to split on them in order until the chunks are small enough. The default list is ["\n\n", "\n", " ", ""]. This has the effect of trying to keep all paragraphs (and then sentences, and then words) together as long as possible, as those would generically seem to be the strongest semantically related pieces of text.

- How the text is split: by list of characters.
- How the chunk size is measured: by number of characters.


## Reading a PDf File


In [2]:
from langchain_community.document_loaders import PyPDFLoader

loader=PyPDFLoader('data/2024_berkshire_hathway_shareholder_letter.pdf')
docs=loader.load()
docs

[Document(metadata={'producer': 'Acrobat Distiller 8.1.0 (Windows)', 'creator': 'PyPDF', 'creationdate': '2025-02-22T07:14:18-06:00', 'moddate': '2025-02-22T07:14:40-06:00', 'title': 'printmgr file', 'source': 'data/2024_berkshire_hathway_shareholder_letter.pdf', 'total_pages': 15, 'page': 0, 'page_label': '1'}, page_content='BERKSHIRE HATHAWAY INC. \nTo the Shareholders of Berkshire Hathaway Inc.: \nThis letter comes to you as part of Berkshire’s annual report. As a public company, we \nare required to periodically tell you many specific facts and figures. \n“Report,” however, implies a greater responsibility. In addition to the mandated data, we \nbelieve we owe you additional commentary about what you own and how we think. Our goal is \nto communicate with you in a manner that we would wish you to use if our positions were \nreversed – that is, if you were Berkshire’s CEO while I and my family were passive investors, \ntrusting you with our savings. \nThis approach leads us to an an

In [3]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

text_splitter=RecursiveCharacterTextSplitter(chunk_size=600,chunk_overlap=50)

final_documents=text_splitter.split_documents(docs)

In [4]:
final_documents

[Document(metadata={'producer': 'Acrobat Distiller 8.1.0 (Windows)', 'creator': 'PyPDF', 'creationdate': '2025-02-22T07:14:18-06:00', 'moddate': '2025-02-22T07:14:40-06:00', 'title': 'printmgr file', 'source': 'data/2024_berkshire_hathway_shareholder_letter.pdf', 'total_pages': 15, 'page': 0, 'page_label': '1'}, page_content='BERKSHIRE HATHAWAY INC. \nTo the Shareholders of Berkshire Hathaway Inc.: \nThis letter comes to you as part of Berkshire’s annual report. As a public company, we \nare required to periodically tell you many specific facts and figures. \n“Report,” however, implies a greater responsibility. In addition to the mandated data, we \nbelieve we owe you additional commentary about what you own and how we think. Our goal is \nto communicate with you in a manner that we would wish you to use if our positions were \nreversed – that is, if you were Berkshire’s CEO while I and my family were passive investors,'),
 Document(metadata={'producer': 'Acrobat Distiller 8.1.0 (Windo

## Text Loader


In [5]:

from langchain_community.document_loaders import TextLoader

loader=TextLoader("data/loreal_shareholder_2022.txt")

docs=loader.load()
docs


[Document(metadata={'source': 'data/loreal_shareholder_2022.txt'}, page_content='“Dear Shareholders,\nL’Oréal continues on the path to success with an ever-stronger ambition, while acting with the sense of responsibility of a global leader. Dual financial and social excellence will always be at the heart of our business model.\nWe have set ourselves the ultimate goal of creating value that benefits everyone.\nWe create value for you, our shareholders.\xa0The resilience and outperformance of your Company are the perfect demonstration of its robust, virtuous and value creating business model. The quality of our results puts us in a position to offer a dividend of €6 per share, representing a significant increase of +25%. And the preferential dividend with a 10% loyalty bonus(1), at €6.60, is recognition of your long-term loyalty.\nI also know that you attach just as much importance to the quality of our relationship with you, our shareholders. I am delighted to welcome the more than 30,0

In [6]:
speech_text=""
with open("data/loreal_shareholder_2022.txt") as f:
    speech_text=f.read()
speech_text

'“Dear Shareholders,\nL’Oréal continues on the path to success with an ever-stronger ambition, while acting with the sense of responsibility of a global leader. Dual financial and social excellence will always be at the heart of our business model.\nWe have set ourselves the ultimate goal of creating value that benefits everyone.\nWe create value for you, our shareholders.\xa0The resilience and outperformance of your Company are the perfect demonstration of its robust, virtuous and value creating business model. The quality of our results puts us in a position to offer a dividend of €6 per share, representing a significant increase of +25%. And the preferential dividend with a 10% loyalty bonus(1), at €6.60, is recognition of your long-term loyalty.\nI also know that you attach just as much importance to the quality of our relationship with you, our shareholders. I am delighted to welcome the more than 30,000 new individual shareholders in France who joined us in 2022.\nWe create value

In [7]:
text_splitter=RecursiveCharacterTextSplitter(chunk_size=100,chunk_overlap=20)
text=text_splitter.create_documents([speech_text])

In [8]:
text

[Document(metadata={}, page_content='“Dear Shareholders,'),
 Document(metadata={}, page_content='L’Oréal continues on the path to success with an ever-stronger ambition, while acting with the'),
 Document(metadata={}, page_content='acting with the sense of responsibility of a global leader. Dual financial and social excellence'),
 Document(metadata={}, page_content='social excellence will always be at the heart of our business model.'),
 Document(metadata={}, page_content='We have set ourselves the ultimate goal of creating value that benefits everyone.'),
 Document(metadata={}, page_content='We create value for you, our shareholders.\xa0The resilience and outperformance of your Company are'),
 Document(metadata={}, page_content='of your Company are the perfect demonstration of its robust, virtuous and value creating business'),
 Document(metadata={}, page_content='creating business model. The quality of our results puts us in a position to offer a dividend of €6'),
 Document(metadata=

In [9]:
print(text[0])

page_content='“Dear Shareholders,'


In [10]:
print(text[1])

page_content='L’Oréal continues on the path to success with an ever-stronger ambition, while acting with the'


In [11]:
print(text[2])

page_content='acting with the sense of responsibility of a global leader. Dual financial and social excellence'


CharacterTextSplitter is like cutting a cake into slices using a knife - it splits text based on a single separator (like a newline).


 RecursiveCharacterTextSplitter is more like dissecting a complex machine - it uses multiple separators (headings, bullet points, etc.) to break text into smaller, more manageable chunks recursively. It ensures chunks are closer to the desired size, even in complex documents, making it suitable for varied text structures.