In [26]:
###Document Data-Structure

from langchain_core.documents import Document


In [27]:
doc= Document(
    page_content= "this is the main text to create a RAG",
    metadata={
        "source":"xyz",
        "pages":1,
        "author":"kushagra",
        "date_created":"2026-01-01"

    }
)
doc

Document(metadata={'source': 'xyz', 'pages': 1, 'author': 'kushagra', 'date_created': '2026-01-01'}, page_content='this is the main text to create a RAG')

## MANUAL LOADER

flattened structured fields into a single dense line with | separators.

That kills semantic boundaries. Embeddings don’t understand “fields”, they understand language structure.

"Index:1|Name:Thermostat Drone Heater|Description:Consumer approach..." is embedding garbage

# Use a manual loader only if:

the official loader cannot extract text correctly at all, or

you need non-standard parsing (weird formats, mixed encodings, broken structure)

In [28]:

import pandas as pd
df = pd.read_csv("../data/sample.csv")
documents=[]

for idx, row in df.iterrows():
    text = "|".join(f"{col}:{row[col]}" for col in df.columns )
    
    doc = Document(
        page_content= text,
        metadata={
            "source": "sample.csv",
            "row":idx
        }
    )

    documents.append(doc)
len(documents), documents[0]

(1000,
 Document(metadata={'source': 'sample.csv', 'row': 0}, page_content='Index:1|Name:Thermostat Drone Heater|Description:Consumer approach woman us those star.|Brand:Bradford-Yu|Category:Kitchen Appliances|Price:74|Currency:USD|Stock:139|EAN:8619793560985|Color:Orchid|Size:Medium|Availability:backorder|Internal ID:38'))

In [29]:
documents[0].page_content
documents[0].metadata

{'source': 'sample.csv', 'row': 0}

# LANGCHAIN LOADER

Each field is clearly separated.

Newlines preserve semantic breaks.

Metadata is clean and minimal.

This will chunk better, retrieve better, and hallucinate less.


that said it is not that manual loader is bad it can come to good when used with a better script and then you can manipulate the metadata tags to keep much more relevant info going forward.

# extra

If it’s useful for meaning, it belongs in page_content.

If it’s useful for control or tracing, it belongs in metadata.


A better loader does two independent things well:

1.produces clean, human-readable page-content.

2.produces minimal, accurate metadata.

In [30]:

from langchain_community.document_loaders import CSVLoader

loader = CSVLoader(file_path="../data/sample.csv")
docs = loader.load()

print(len(docs))
print(docs[0].page_content)
print(docs[0].metadata)

1000
Index: 1
Name: Thermostat Drone Heater
Description: Consumer approach woman us those star.
Brand: Bradford-Yu
Category: Kitchen Appliances
Price: 74
Currency: USD
Stock: 139
EAN: 8619793560985
Color: Orchid
Size: Medium
Availability: backorder
Internal ID: 38
{'source': '../data/sample.csv', 'row': 0}


In [33]:
from langchain_community.document_loaders import PyMuPDFLoader

loader = PyMuPDFLoader(file_path="../data/sample2.pdf")
docss=loader.load()

print(len(docss))
print(docss[0].page_content)
print(docss[0].metadata)

2
Mumma's Kitchen
Twister
Sandwich  &
Hot Dog
Cheese Chilli 
Peri Peri 
Schezwan 
Lemon Chilli 
Italian 
Tandoori Masala
Cheese Chutney
Cheese Chilli 
Chatpata Indori
Cheese Corn 
Paneer Schezwan
Cheese Burst
Chatpata Hot Dog 
Veg Aalu Tikki Hot Dog 
Paneer Tikka Hot Dog 
₹ 85
₹ 95
₹ 115
₹ 105
₹ 125
₹ 135
₹ 129
₹ 135
₹ 129
₹ 89
₹ 99
₹ 99
₹ 99
₹ 89
₹ 89
Pizzas
Pasta
Go-To 
Indie-Mexican 
Oh-Cheese! 
Desi Chirpira 
Toofani Mexican
Crunchy Kurkure 
Peri-Peri Spicy
Pro-Max Cheese 
Paneer Shaukeen
Cheesy Fries Supreme
Sab Par Bhari 
Pasta Arrabiata 
(Penne pasta tossed in authentic
red sauce)
Pasta Alfredo 
(Penne pasta tossed in creamy
white sauce)
Baked Cheesy Pasta
(Arrabiata pasta baked to
perfection with extra cheese)
Baked Alfredo 
Green Wave 
(Capsicum, Jalapeno, Onion) 
Farm Fresh 
(The evergreen combination of Onion and Capsicum) 
Margherita 
(The classic pizza sauce and mozzarella cheese) 
Corn Feast 
(Golden Corn and lots of cheese) 
Veggie Blast 
(Capsicum, Onion, Golden Corn, O

In [None]:
from langchain_community.document_loaders import PyMuPDFLoader

loader = PyMuPDFLoader(file_path="../data/sample3.pdf")
docsnew=loader.load()

print(len(docsnew))
print(docsnew[10].page_content)
print(docsnew[10].metadata)

88
The Audiovisual 
In March of 1995, a limousine carrying Ted Koppel, the host of ABC-TV's “Nightline” pulled up to the 
snow-covered curb outside Morrie's house in West Newton, Massachusetts. 
Morrie was in a wheelchair full-time now, getting used to helpers lifting him like a heavy sack from the 
chair to the bed and the bed to the chair. He had begun to cough while eating, and chewing was a chore. 
His legs were dead; he would never walk again. 
 
Yet he refused to be depressed. Instead, Morrie had become a lightning rod of ideas. He jotted down his 
thoughts on yellow pads, envelopes, folders, scrap paper. He wrote bite-sized philosophies about living 
with death's shadow: “Accept what you are able to do and what 
you are not able to do”; “Accept the past as past, without denying it or discarding it”; “Learn to forgive 
yourself and to forgive others”; “Don't assume that it's too late to get involved.” 
After a while, he had more than fifty of these “aphorisms,” which he shared wi

In [3]:
### TExt splitting get into chunks 
from langchain_text_splitters import RecursiveCharacterTextSplitter
def split_documents(docsnew,chunk_size=600,chunk_overlap=100):
    text_splitter= RecursiveCharacterTextSplitter(
        chunk_overlap=chunk_overlap,
        chunk_size=chunk_size,
        length_function=len,
        separators=["\n\n","\n"," ",""]
    )
    split_docs = text_splitter.split_documents(docsnew)
    print(f"Split{len(docsnew)} documents into {len(split_docs)} chunks")