In [1]:
!pip install -q langchain langchain-openai langchain-community
!pip install -q langchain-text-splitter langchain-postgres

[31mERROR: Could not find a version that satisfies the requirement langchain-text-splitter (from versions: none)[0m[31m
[0m[31mERROR: No matching distribution found for langchain-text-splitter[0m[31m
[0m

In [2]:
!pip install -q "unstructured[pdf]"

In [3]:
import os
from google.colab import userdata

OPENAI_API_KEY = userdata.get('OPENAI_API_KEY')
os.environ['OPENAI_API_KEY'] = OPENAI_API_KEY

In [4]:
from langchain_community.document_loaders import TextLoader

loader = TextLoader('/content/PlainText.txt')
docs = loader.load()

In [5]:
print(docs[0].page_content)
print(docs[0].metadata)

A TXT file is a type of file that stores plain text without any special formatting, styling, or complex media. It is a simple, universal format that can be opened on almost any device using a basic text editor like Notepad on Windows or TextEdit on Mac. These files are used for storing simple text documents, source code, and configuration data.  

Key characteristics
	•	Plain text: Contains only characters, not complex formatting like bold, italics, or different fonts.    
	•	Simple: Easy to create and edit on a wide variety of operating systems and devices.    
	•	Universal: Highly compatible across different software and hardware.    
	•	Small size: The lack of formatting results in smaller file sizes compared to rich text documents.    
	•	Versatile: Can store anything from a simple note to complex data like computer code or scripts.    
How to create and use a TXT file
	•	Create: Use a simple text editor application on your computer. For example, on a Mac, you can set TextEdit to d

In [6]:
# metadata is important later on for retrieval

loader = TextLoader('/content/PlainText.txt')
docs = loader.load()

for d in docs:
  d.metadata["owner"] = "Hemant"
  d.metadata["type"] = "notes"

print(docs[0].metadata)

{'source': '/content/PlainText.txt', 'owner': 'Hemant', 'type': 'notes'}


In [7]:
# load multiple files

# from langchain_community.document_loaders import TextLoader

# loader = TextLoader('/content/*.txt')
# docs = loader.load()

In [8]:
# load PDF

# from langchain_community.document_loaders import PyPDFLoader

# loader = PyPDFLoader('/content/notes.pdf')
# docs = loader.load()

### Break the code intentionally and check the errors

In [9]:
# import file that's not avaialable

TextLoader('/content/does_not_exist.txt').load()

RuntimeError: Error loading /content/does_not_exist.txt

In [10]:
# load entire directory with TextLoader

TextLoader('/content').load()

RuntimeError: Error loading /content

In [11]:
from langchain_community.document_loaders import DirectoryLoader

loader = DirectoryLoader(
    "/content",
    glob="**/*.txt",      # only .txt files
    show_progress=True
)

docs = loader.load()

100%|██████████| 1/1 [00:02<00:00,  2.10s/it]


In [12]:
docs

[Document(metadata={'source': '/content/PlainText.txt'}, page_content='A TXT file is\xa0a type of file that stores plain text without any special formatting, styling, or complex media.\xa0It is a simple, universal format that can be opened on almost any device using a basic text editor like Notepad on Windows or TextEdit on Mac.\xa0These files are used for storing simple text documents, source code, and configuration data.\n\nKey characteristics •\tPlain text:\xa0Contains only characters, not complex formatting like bold, italics, or different fonts. •\tSimple:\xa0Easy to create and edit on a wide variety of operating systems and devices. •\tUniversal:\xa0Highly compatible across different software and hardware. •\tSmall size:\xa0The lack of formatting results in smaller file sizes compared to rich text documents. •\tVersatile:\xa0Can store anything from a simple note to complex data like computer code or scripts. How to create and use a TXT file •\tCreate:\xa0Use a simple text editor 

In [13]:
loader_all = DirectoryLoader(
    path='/content',
    glob="*",
    show_progress=True
)

docs_all = loader_all.load()

  0%|          | 0/2 [00:00<?, ?it/s]



100%|██████████| 2/2 [00:48<00:00, 24.46s/it]


### Questions

Why does LangChain wrap files into Document objects instead of returning raw strings

What is the purpose of metadata in the RAG ecosystem?

How does the Document abstraction help with chunking and vector stores?

### Answers

You need page_content for embeddings

You need metadata for retrieval filtering

You may split documents into chunks later

Vector stores and retrievers expect a uniform format

Keeps file-specific logic inside loaders, not polluting your RAG code



In [14]:
from langchain_community.document_loaders import WebBaseLoader

loader = WebBaseLoader("https://www.langchain.com/")
docs = loader.load()



In [15]:
docs[0].metadata

{'source': 'https://www.langchain.com/',
 'title': 'LangChain',
 'description': 'LangChain provides the engineering platform and open source frameworks developers use to build, test, and deploy reliable AI agents.',
 'language': 'en'}

In [16]:
from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader('/content/Tesla10K.pdf')
pages = loader.load()

In [17]:
pages

[Document(metadata={'producer': 'Skia/PDF m141', 'creator': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/141.0.0.0 Safari/537.36', 'creationdate': '2025-12-03T10:01:47+00:00', 'title': '10-K', 'moddate': '2025-12-03T10:01:47+00:00', 'source': '/content/Tesla10K.pdf', 'total_pages': 115, 'page': 0, 'page_label': '1'}, page_content='id           \nUNITED STATES\nSECURITIES AND EXCHANGE COMMISSION\nWashington, D.C. 20549\nFORM 10-K\n(Mark One)\n☒ ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934\nFor the fiscal year ended December 31, 2022\nOR\n☐ TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d) OF THE SECURITIES EXCHANGE ACT OF 1934\nFor the transition period from _________ to _________\nCommission File Number: 001-34756\nTesla, Inc.\n(Exact name of registrant as specified in its charter)\n  \nDelaware  91-2197729\n(State or other jurisdiction of\nincorporation or organization)\n (I.R.S. Employer\nIdenti

The text has been extracted from the PDF document and stored in the Document
class. But there’s a problem. The loaded document is over 100,000 characters long,
so it won’t fit into the context window of the vast majority of LLMs or embedding
models. In order to overcome this limitation, we need to split the Document into man‐
ageable chunks of text that we can later convert into embeddings and semantically
search

In [18]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

docs = TextLoader('/content/PlainText.txt').load()

splitter = RecursiveCharacterTextSplitter(
    chunk_size=100,
    chunk_overlap=20
)

splitted_docs = splitter.split_documents(docs)

In [19]:
splitted_docs

[Document(metadata={'source': '/content/PlainText.txt'}, page_content='A TXT file is\xa0a type of file that stores plain text without any special formatting, styling, or'),
 Document(metadata={'source': '/content/PlainText.txt'}, page_content='styling, or complex media.\xa0It is a simple, universal format that can be opened on almost any device'),
 Document(metadata={'source': '/content/PlainText.txt'}, page_content='almost any device using a basic text editor like Notepad on Windows or TextEdit on Mac.\xa0These files'),
 Document(metadata={'source': '/content/PlainText.txt'}, page_content='on Mac.\xa0These files are used for storing simple text documents, source code, and configuration'),
 Document(metadata={'source': '/content/PlainText.txt'}, page_content='and configuration data.'),
 Document(metadata={'source': '/content/PlainText.txt'}, page_content='Key characteristics'),
 Document(metadata={'source': '/content/PlainText.txt'}, page_content='•\tPlain text:\xa0Contains only charac

In [20]:
len(splitted_docs)

23

In [21]:
for doc in splitted_docs:
  print(doc.page_content, '\t', len(doc.page_content))

A TXT file is a type of file that stores plain text without any special formatting, styling, or 	 95
styling, or complex media. It is a simple, universal format that can be opened on almost any device 	 99
almost any device using a basic text editor like Notepad on Windows or TextEdit on Mac. These files 	 99
on Mac. These files are used for storing simple text documents, source code, and configuration 	 94
and configuration data. 	 23
Key characteristics 	 19
•	Plain text: Contains only characters, not complex formatting like bold, italics, or different 	 95
or different fonts. 	 19
•	Simple: Easy to create and edit on a wide variety of operating systems and devices. 	 85
•	Universal: Highly compatible across different software and hardware. 	 70
•	Small size: The lack of formatting results in smaller file sizes compared to rich text 	 88
to rich text documents. 	 23
•	Versatile: Can store anything from a simple note to complex data like computer code or 	 88
computer code or scripts.

In [22]:
# change the chunk_size to see its effect

splitter = RecursiveCharacterTextSplitter(chunk_size=300, chunk_overlap=20)
splitted_docs = splitter.split_documents(docs)
splitted_docs

[Document(metadata={'source': '/content/PlainText.txt'}, page_content='A TXT file is\xa0a type of file that stores plain text without any special formatting, styling, or complex media.\xa0It is a simple, universal format that can be opened on almost any device using a basic text editor like Notepad on Windows or TextEdit on Mac.\xa0These files are used for storing simple text'),
 Document(metadata={'source': '/content/PlainText.txt'}, page_content='storing simple text documents, source code, and configuration data.'),
 Document(metadata={'source': '/content/PlainText.txt'}, page_content='Key characteristics\n\t•\tPlain text:\xa0Contains only characters, not complex formatting like bold, italics, or different fonts.\xa0\u2028\u2028\u2028\n\t•\tSimple:\xa0Easy to create and edit on a wide variety of operating systems and devices.\xa0\u2028\u2028\u2028\n\t•\tUniversal:\xa0Highly compatible across different software and hardware.'),
 Document(metadata={'source': '/content/PlainText.txt'}, 

In [23]:
# Trying hierarchical splitting: break on paragraphs → sentences → spaces → characters.

splitter = RecursiveCharacterTextSplitter(
    chunk_size=100,
    chunk_overlap=20,
    separators=["\n\n","\n",". "," ",""]
)

splitted_docs = splitter.split_documents(docs)
splitted_docs

[Document(metadata={'source': '/content/PlainText.txt'}, page_content='A TXT file is\xa0a type of file that stores plain text without any special formatting, styling, or'),
 Document(metadata={'source': '/content/PlainText.txt'}, page_content='styling, or complex media.\xa0It is a simple, universal format that can be opened on almost any device'),
 Document(metadata={'source': '/content/PlainText.txt'}, page_content='almost any device using a basic text editor like Notepad on Windows or TextEdit on Mac.\xa0These files'),
 Document(metadata={'source': '/content/PlainText.txt'}, page_content='on Mac.\xa0These files are used for storing simple text documents, source code, and configuration'),
 Document(metadata={'source': '/content/PlainText.txt'}, page_content='and configuration data.'),
 Document(metadata={'source': '/content/PlainText.txt'}, page_content='Key characteristics'),
 Document(metadata={'source': '/content/PlainText.txt'}, page_content='•\tPlain text:\xa0Contains only charac

In [24]:
for doc in splitted_docs:
  print(doc.metadata)

{'source': '/content/PlainText.txt'}
{'source': '/content/PlainText.txt'}
{'source': '/content/PlainText.txt'}
{'source': '/content/PlainText.txt'}
{'source': '/content/PlainText.txt'}
{'source': '/content/PlainText.txt'}
{'source': '/content/PlainText.txt'}
{'source': '/content/PlainText.txt'}
{'source': '/content/PlainText.txt'}
{'source': '/content/PlainText.txt'}
{'source': '/content/PlainText.txt'}
{'source': '/content/PlainText.txt'}
{'source': '/content/PlainText.txt'}
{'source': '/content/PlainText.txt'}
{'source': '/content/PlainText.txt'}
{'source': '/content/PlainText.txt'}
{'source': '/content/PlainText.txt'}
{'source': '/content/PlainText.txt'}
{'source': '/content/PlainText.txt'}
{'source': '/content/PlainText.txt'}
{'source': '/content/PlainText.txt'}
{'source': '/content/PlainText.txt'}
{'source': '/content/PlainText.txt'}


In [25]:
splitted_docs[0].metadata["owner"] = "Hemant"

In [26]:
splitted_docs[0].metadata

{'source': '/content/PlainText.txt', 'owner': 'Hemant'}

### Why use RecursiveCharacterTextSplitter:

Preserves semantic boundaries (paragraph → sentence → space → chars)

Produces predictable chunk sizes

Works with any text (robust)

Preserves metadata (important for retrieval)

Supports hierarchical fallback splitting

Chunking is done BEFORE embedding so the embedding model sees coherent text.

The overlap avoids losing meaning when text spans between chunks.

### Questions:

Conceptual

Q: “Why does recursive splitting use a hierarchy of separators?”

A: Because it tries to preserve meaningful boundaries (paragraphs → sentences → words → characters) while still enforcing a strict maximum chunk size.

“How does chunk size affect embedding quality?”

A: Chunk size too small, can't get proper context, retrieval becomes noisy. Chunk size too big, embedding represents an average of multiple ideas, harder to find match for specific query, use 300-500 token for most models.

Q: “When should I use token-based splitting instead of character-based?”

A:
Need to respect embedding model token limits:	Yes

Non-English languages:	Yes

Code/logs/structured data:	Yes

Cost-sensitive pipelines:	Yes

Small/simple English text:	Optional

Quick prototype:	Optional

Memory-constrained system	Token-based is safer

Q: “Why does metadata matter during splitting?”

A: Metadata matters because after splitting, each chunk becomes its own Document, and the metadata is the only way to track its origin (which file, page, paragraph it came from), filter during retrieval, reconstruct context, debug retrieval.

Q: “What are alternatives to RecursiveCharacterTextSplitter?”

A:
1. Token-based splitters (preferred when token limits matter) (TokenTextSplitter, RecursiveTokenTextSplitter)

2. Semantic splitters (split based on meaning) (SemanticChunker, LLM-based chunkers (GPT-powered)), used for long unstructured documents

3. Markdown or HTML-aware splitters (MarkdownHeaderTextSplitter, HTMLHeaderTextSplitter), used for documents with hierarchical structure.

4. Sentence-level splitters (NLTKTextSplitter, SpacyTextSplitter), when chunking should preserve grammatical units

5. Fixed-size splitters (simple but crude) (CharacterTextSplitter (non-recursive)) when input is already clean or uniform

In [30]:
from langchain_text_splitters import Language

PYTHON_CODE = """
  def hello_world():
  print("Hello, World!")
  # Call the function
  hello_world()
"""

python_splitter = RecursiveCharacterTextSplitter.from_language(
    language=Language.PYTHON,
    chunk_size=50,
    chunk_overlap=0
)

python_docs = python_splitter.create_documents([PYTHON_CODE])

In [31]:
python_docs

[Document(metadata={}, page_content='def hello_world():\n  print("Hello, World!")'),
 Document(metadata={}, page_content='# Call the function\n  hello_world()')]

In [32]:
markdown_text = """
# LangChain
⚡ Building applications with LLMs through composability ⚡
## Quick Install
```bash
pip install langchain
```
As an open source project in a rapidly developing field, we are extremely open
 to contributions.
"""
md_splitter = RecursiveCharacterTextSplitter.from_language(
 language=Language.MARKDOWN, chunk_size=60, chunk_overlap=0
)
md_docs = md_splitter.create_documents([markdown_text], [{"source": "https://www.langchain.com"}])

In [33]:
md_docs

[Document(metadata={'source': 'https://www.langchain.com'}, page_content='# LangChain'),
 Document(metadata={'source': 'https://www.langchain.com'}, page_content='⚡ Building applications with LLMs through composability ⚡'),
 Document(metadata={'source': 'https://www.langchain.com'}, page_content='## Quick Install\n```bash\npip install langchain'),
 Document(metadata={'source': 'https://www.langchain.com'}, page_content='```'),
 Document(metadata={'source': 'https://www.langchain.com'}, page_content='As an open source project in a rapidly developing field, we'),
 Document(metadata={'source': 'https://www.langchain.com'}, page_content='are extremely open'),
 Document(metadata={'source': 'https://www.langchain.com'}, page_content='to contributions.')]

### Generating text embeddings

In [49]:
from langchain_openai import OpenAIEmbeddings

model = OpenAIEmbeddings(model="text-embedding-3-large")

embeddings = model.embed_documents([
    "Hi There",
    "My name is Hemant",
    "What's you name?",
    "Nice to meet you!"
])

In [50]:
len(embeddings), len(embeddings[0])

(4, 3072)

In [45]:
# Let's try a smaller model and see if embedding dimension changes

emb = OpenAIEmbeddings(model="text-embedding-3-small")

embeddings2 = emb.embed_documents([
    "Hi There",
    "My name is Hemant",
    "What's you name?",
    "Nice to meet you!"
])

In [46]:
len(embeddings2), len(embeddings2[0])

(4, 1536)

In [52]:
# Let's try to embed a long sentence

emb_long = emb.embed_query('A TXT file is\xa0a type of file that stores plain text without any special formatting, styling, or complex media.\xa0It is a simple, universal format that can be opened on almost any device using a basic text editor like Notepad on Windows or TextEdit on Mac.\xa0These files are used for storing simple text documents, source code, and configuration data.\n\nKey characteristics •\tPlain text:\xa0Contains only characters, not complex formatting like bold, italics, or different fonts. •\tSimple:\xa0Easy to create and edit on a wide variety of operating systems and devices. •\tUniversal:\xa0Highly compatible across different software and hardware. •\tSmall size:\xa0The lack of formatting results in smaller file sizes compared to rich text documents. •\tVersatile:\xa0Can store anything from a simple note to complex data like computer code or scripts. How to create and use a TXT file •\tCreate:\xa0Use a simple text editor application on your computer.\xa0For example, on a Mac, you can set TextEdit to default to plain text, or on a Chromebook, use the built-in Text app. •\tSave:\xa0Save the file with the\xa0.txt\xa0extension. •\tOpen:\xa0Double-click the file to open it in your default text editor, or right-click and choose "Open with..." to select another program, such as Microsoft Word. •\tConvert:\xa0You can also open and convert a .txt file into other formats, like a PDF, using tools like\xa0Adobe Acrobat online services.')

In [54]:
len(emb_long)

1536

In [55]:
emb_q = emb.embed_query("Hi")
emb_d = emb.embed_documents(["Hi"])

In [59]:
len(emb_q), len(emb_d[0])

(1536, 1536)

In [61]:
emb_q

[-0.006926344707608223,
 -0.035333964973688126,
 0.0015698851784691215,
 0.0653456598520279,
 0.032968513667583466,
 -0.024186765775084496,
 -0.026138264685869217,
 0.0494379848241806,
 0.01620335876941681,
 -0.05162603035569191,
 -0.01343134231865406,
 -0.014554932713508606,
 -0.0260199923068285,
 -0.0032765232026576996,
 0.024556368589401245,
 0.0011457666987553239,
 -0.053488824516534805,
 0.015079768374562263,
 0.011457666754722595,
 0.0339442640542984,
 0.049319714307785034,
 0.02037247084081173,
 -0.013970961794257164,
 0.018864493817090988,
 0.017179109156131744,
 0.024260686710476875,
 0.018258346244692802,
 -0.0012076750863343477,
 0.019588913768529892,
 -0.036812376230955124,
 0.02767580933868885,
 -0.028237605467438698,
 0.027616674080491066,
 -0.01627727970480919,
 -0.011723781004548073,
 -0.015966814011335373,
 -0.014118802733719349,
 0.037551578134298325,
 0.018849710002541542,
 -0.037699420005083084,
 0.04346521571278572,
 -0.01241863239556551,
 0.020934266969561577,
 0.

In [62]:
emb_d[0]

[-0.0069594960659742355,
 -0.035274259746074677,
 0.0015957315918058157,
 0.06534460932016373,
 0.03293841332197189,
 -0.024201158434152603,
 -0.02610827423632145,
 0.04937804862856865,
 0.01623266376554966,
 -0.05168433114886284,
 -0.013357206247746944,
 -0.014599049463868141,
 -0.026019571349024773,
 -0.003257990349084139,
 0.024585537612438202,
 0.001171619864180684,
 -0.05345839262008667,
 0.015057348646223545,
 0.011487049050629139,
 0.03394371271133423,
 0.04934848099946976,
 0.020372141152620316,
 -0.01396334357559681,
 0.01887897402048111,
 0.017149262130260468,
 0.024156806990504265,
 0.01827283576130867,
 -0.0011956436792388558,
 0.01955902948975563,
 -0.03678221255540848,
 0.027675362303853035,
 -0.028207581490278244,
 0.027645794674754143,
 -0.01623266376554966,
 -0.011716199107468128,
 -0.01604047417640686,
 -0.01407422311604023,
 0.03758053854107857,
 0.01887897402048111,
 -0.037698812782764435,
 0.04343494400382042,
 -0.012411040253937244,
 0.020948711782693863,
 0.01355

Q: How are document vs query embeddings different internally?

A: They can be produced by different heads of the model. Query embeddings are optimized for retrieval as queries, while document embeddings are optimized as indexed items.

Query embeddings emphasize what the user is asking.
Document embeddings emphasize what the document is about.

For many embedding models (OpenAI included):

embed_documents() → applies the document embedding head

embed_query() → applies the query embedding head



In [63]:
# Now let's load, split and embed a document

doc = TextLoader('/content/PlainText.txt').load()
splitted_docs = RecursiveCharacterTextSplitter(chunk_size=100, chunk_overlap=20).split_documents(doc)
model = OpenAIEmbeddings()
embeddings = model.embed_documents([doc.page_content for doc in splitted_docs])

In [66]:
len(splitted_docs), len(embeddings), len(embeddings[0])

(23, 23, 1536)