In [5]:
###Document Data-Structure

from langchain_core.documents import Document


In [6]:
doc= Document(
    page_content= "this is the main text to create a RAG",
    metadata={
        "source":"xyz",
        "pages":1,
        "author":"kushagra",
        "date_created":"2026-01-01"

    }
)
doc

Document(metadata={'source': 'xyz', 'pages': 1, 'author': 'kushagra', 'date_created': '2026-01-01'}, page_content='this is the main text to create a RAG')

## MANUAL LOADER

flattened structured fields into a single dense line with | separators.

That kills semantic boundaries. Embeddings don’t understand “fields”, they understand language structure.

"Index:1|Name:Thermostat Drone Heater|Description:Consumer approach..." is embedding garbage

# Use a manual loader only if:

the official loader cannot extract text correctly at all, or

you need non-standard parsing (weird formats, mixed encodings, broken structure)

In [7]:

import pandas as pd
df = pd.read_csv("../data/sample.csv")
documents=[]

for idx, row in df.iterrows():
    text = "|".join(f"{col}:{row[col]}" for col in df.columns )
    
    doc = Document(
        page_content= text,
        metadata={
            "source": "sample.csv",
            "row":idx
        }
    )

    documents.append(doc)
len(documents), documents[0]

(1000,
 Document(metadata={'source': 'sample.csv', 'row': 0}, page_content='Index:1|Name:Thermostat Drone Heater|Description:Consumer approach woman us those star.|Brand:Bradford-Yu|Category:Kitchen Appliances|Price:74|Currency:USD|Stock:139|EAN:8619793560985|Color:Orchid|Size:Medium|Availability:backorder|Internal ID:38'))

In [8]:
documents[0].page_content
documents[0].metadata

{'source': 'sample.csv', 'row': 0}

# LANGCHAIN LOADER

Each field is clearly separated.

Newlines preserve semantic breaks.

Metadata is clean and minimal.

This will chunk better, retrieve better, and hallucinate less.


that said it is not that manual loader is bad it can come to good when used with a better script and then you can manipulate the metadata tags to keep much more relevant info going forward.

# extra

If it’s useful for meaning, it belongs in page_content.

If it’s useful for control or tracing, it belongs in metadata.


A better loader does two independent things well:

1.produces clean, human-readable page-content.

2.produces minimal, accurate metadata.

In [9]:

from langchain_community.document_loaders import CSVLoader

loader = CSVLoader(file_path="../data/sample.csv")
docs = loader.load()

print(len(docs))
print(docs[0].page_content)
print(docs[0].metadata)

1000
Index: 1
Name: Thermostat Drone Heater
Description: Consumer approach woman us those star.
Brand: Bradford-Yu
Category: Kitchen Appliances
Price: 74
Currency: USD
Stock: 139
EAN: 8619793560985
Color: Orchid
Size: Medium
Availability: backorder
Internal ID: 38
{'source': '../data/sample.csv', 'row': 0}


# now working with some pdfs


In [10]:
from langchain_community.document_loaders import PyMuPDFLoader

loader = PyMuPDFLoader(file_path="../data/sample2.pdf")
docss=loader.load()

print(len(docss))
print(docss[0].page_content)
print(docss[0].metadata)

2
Mumma's Kitchen
Twister
Sandwich  &
Hot Dog
Cheese Chilli 
Peri Peri 
Schezwan 
Lemon Chilli 
Italian 
Tandoori Masala
Cheese Chutney
Cheese Chilli 
Chatpata Indori
Cheese Corn 
Paneer Schezwan
Cheese Burst
Chatpata Hot Dog 
Veg Aalu Tikki Hot Dog 
Paneer Tikka Hot Dog 
₹ 85
₹ 95
₹ 115
₹ 105
₹ 125
₹ 135
₹ 129
₹ 135
₹ 129
₹ 89
₹ 99
₹ 99
₹ 99
₹ 89
₹ 89
Pizzas
Pasta
Go-To 
Indie-Mexican 
Oh-Cheese! 
Desi Chirpira 
Toofani Mexican
Crunchy Kurkure 
Peri-Peri Spicy
Pro-Max Cheese 
Paneer Shaukeen
Cheesy Fries Supreme
Sab Par Bhari 
Pasta Arrabiata 
(Penne pasta tossed in authentic
red sauce)
Pasta Alfredo 
(Penne pasta tossed in creamy
white sauce)
Baked Cheesy Pasta
(Arrabiata pasta baked to
perfection with extra cheese)
Baked Alfredo 
Green Wave 
(Capsicum, Jalapeno, Onion) 
Farm Fresh 
(The evergreen combination of Onion and Capsicum) 
Margherita 
(The classic pizza sauce and mozzarella cheese) 
Corn Feast 
(Golden Corn and lots of cheese) 
Veggie Blast 
(Capsicum, Onion, Golden Corn, O

In [11]:
###MAIN PDF WE ARE USING GOING AHEAD 



from langchain_community.document_loaders import PyMuPDFLoader

loader = PyMuPDFLoader(file_path="../data/sample3.pdf")
docsnew=loader.load()

print(len(docsnew))
print(docsnew[10].page_content)
print(docsnew[10].metadata)

88
The Audiovisual 
In March of 1995, a limousine carrying Ted Koppel, the host of ABC-TV's “Nightline” pulled up to the 
snow-covered curb outside Morrie's house in West Newton, Massachusetts. 
Morrie was in a wheelchair full-time now, getting used to helpers lifting him like a heavy sack from the 
chair to the bed and the bed to the chair. He had begun to cough while eating, and chewing was a chore. 
His legs were dead; he would never walk again. 
 
Yet he refused to be depressed. Instead, Morrie had become a lightning rod of ideas. He jotted down his 
thoughts on yellow pads, envelopes, folders, scrap paper. He wrote bite-sized philosophies about living 
with death's shadow: “Accept what you are able to do and what 
you are not able to do”; “Accept the past as past, without denying it or discarding it”; “Learn to forgive 
yourself and to forgive others”; “Don't assume that it's too late to get involved.” 
After a while, he had more than fifty of these “aphorisms,” which he shared wi

# Chunking

In [12]:
### TExt splitting get into chunks 
from langchain_text_splitters import RecursiveCharacterTextSplitter
def split_documents(docsnew,chunk_size=600,chunk_overlap=100):
    text_splitter= RecursiveCharacterTextSplitter(
        chunk_overlap=chunk_overlap,
        chunk_size=chunk_size,
        length_function=len,
        separators=["\n\n","\n"," ",""]
    )
    split_docs = text_splitter.split_documents(docsnew)
    print(f"Split{len(docsnew)} documents into {len(split_docs)} chunks")

    if split_docs:
        print(f"\n Example chunk:")
        print(f"Content: {split_docs[0].page_content[:200]}....")
        print(f"Metadata: {split_docs[0].metadata}")
    return split_docs

In [13]:
chunks = split_documents(docsnew)

Split88 documents into 413 chunks

 Example chunk:
Content: TUESDAYS  
WITH  
MORRIE: 
 
AN OLD MAN,  
A YOUNG MAN,  
AND  
LIFE'S GREATEST LESSON 
 
 
 
 Mitch Albom....
Metadata: {'producer': 'GPL Ghostscript 8.15', 'creator': 'PScript5.dll Version 5.2.2', 'creationdate': 'D:20080607104704', 'source': '../data/sample3.pdf', 'file_path': '../data/sample3.pdf', 'total_pages': 88, 'format': 'PDF 1.4', 'title': 'Microsoft Word - tuedays with morrie', 'author': 'dixon', 'subject': '', 'keywords': '', 'moddate': 'D:20080607104704', 'trapped': '', 'modDate': 'D:20080607104704', 'creationDate': 'D:20080607104704', 'page': 1}


In [14]:
# CLEANING PART OF CHUNKING REMOVING UNNECESARRY PAGES LIKE TITLE PAGES AND OTHER CREDITS BASED ON PAGE OR DOCUMENT LENGTH 

clean_docs = [
    d for d in docsnew
    if len(d.page_content.strip()) > 300
]


new_chunks = split_documents(clean_docs)

Split85 documents into 411 chunks

 Example chunk:
Content: Acknowledgments 
I would like to acknowledge the enormous help given to me in creating this book. For their memories, 
their patience, and their guidance, I wish to thank Charlotte, Rob, and Jonathan ....
Metadata: {'producer': 'GPL Ghostscript 8.15', 'creator': 'PScript5.dll Version 5.2.2', 'creationdate': 'D:20080607104704', 'source': '../data/sample3.pdf', 'file_path': '../data/sample3.pdf', 'total_pages': 88, 'format': 'PDF 1.4', 'title': 'Microsoft Word - tuedays with morrie', 'author': 'dixon', 'subject': '', 'keywords': '', 'moddate': 'D:20080607104704', 'trapped': '', 'modDate': 'D:20080607104704', 'creationDate': 'D:20080607104704', 'page': 2}


# extra info 

we can add manual docs to this as well for our convinience as inour case we took the book "Tuesdays with morries" 
and we decided to leave out pages like index and title, any "docs" which had less than 300 words 
which resulted is us excluding the author of the book from our database as it was in the Title page and it did not fit in our constraints of the pages we have kept in our clean_docs 
which will or may result in model hallucinations if asked bout the author 
in that case we can make out manual doc like:


 page_content = "The book 'Tuesdays with Morrie' is written by Mitch Albom."
metadata = {"type": "document_info"}


-----------------------------------------------------------------------------------------------


# Embeddings 


In [None]:
#### %pip install sentence-transformers ###########

#since the venv environment wasn't active i was forced to install it again this way 
#hence always activate venv first


Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 24.0 -> 25.3
[notice] To update, run: python.exe -m pip install --upgrade pip


In [None]:
import numpy as np
from sentence_transformers import SentenceTransformer
import chromadb
from chromadb.config import Settings
import uuid
from typing import List,Dict,Any,Tuple
from sklearn.metrics.pairwise import cosine_similarity

In [21]:
# model= SentenceTransformer("all-MiniLM-L6-v2")

# texts= [c.page_content for c in chunks]

# embeddings = model.encode(texts, show_progress_bar=True)
# # # what is encode doing here:
# # 
# # #  for each chunk text:
# #     run it through the transformer
# #     compress its meaning into 384 numbers




# print(embeddings.shape)

# #  [number_of_chunks , embedding_dimension]
# #   → (413 , 384)



413 = how many chunks you fed in

384 = how many semantic features the model uses

## EXAMPLE:

### STEP 1: EMBEDDING OUTPUT SHAPE

Assume we have 3 chunks:

C1: "Morrie teaches lessons about life and death."

C2: "The book discusses love, forgiveness, and acceptance."

C3: "A recipe for making pasta with cheese."


these are the 3 chunks which when inputed the embeddings gives us 7 semantic features for all the chunks 

### STEP 2: WHAT THOSE 7 FEATURES ARE

Let’s pretend the 7 dimensions loosely correspond to:

1. life / philosophy

2. emotions / relationships

3. death / mortality

4. food / cooking

5. actions / processes

6. positivity

7. negativity

### STEP 3: EXAMPLE EMBEDDINGS (FAKE NUMBERS, REAL IDEA)

C1 → [0.9, 0.6, 0.8, 0.0, 0.2, 0.5, 0.1]

C2 → [0.8, 0.9, 0.3, 0.0, 0.1, 0.7, 0.0]

C3 → [0.0, 0.1, 0.0, 0.9, 0.6, 0.2, 0.1]


for all the three chunks the higher number represents the similarity of that chunk to that feature,

so for C1 the highest is 0.9 which is the 1st feature (i.e life/philosphy) which is true the most similar feature for C1 indeed 

is the 1st feature and some may say the 3rd which has 0.8, but we get the idea 

### STEP 4: WHY SIMILARITY SEARCH WORKS 

if you ask:

"What lessons does Morrie teach about life?"

That question becomes another 7-dim vector, say:

Q → [0.85, 0.7, 0.6, 0.0, 0.2, 0.6, 0.1]

now since the this question is embedded into vector and the 1st feature has 0.85 magnitude 

it is closest to C1 and C2 distance wise and C3 not 

and if we look at the question logically from our pov it makes sense to retrieve C1 and C2 to answer the question 

cause they are closely related and almost answers the whole question, now add LLM to this and it will be nicely rephrased 

and the answer will be presented to the user.

In [None]:

class EmbeddingManager:
    """Handles document embedding generation using SentenceTransformers"""
    def __init__(self,model_name:str ="all-MiniLM-L6-v2"):
        """
        Intialize the embeddings manager

        args:
        model_name: HuggingFace model name for sentence embeddings
        """
        self.model_name = model_name
        self.model = None
        self._load_model() ### we use "_" underscores because it is how a protected function is written in python.

    def _load_model(self):
        """Load the SentenceTransformer model"""
        try:
            print(f"Loading embedding model:{self.model_name}")
            self.model = SentenceTransformer(self.model_name)
            print(f"Model loaded successfully, Embedding dimension:{self.model.get_sentence_embedding_dimension()}")
        except Exception as e:
            print(f"Error Loading model:{self.model_name}: {e}")
            raise
    def generate_embeddings(self, text: List[str]) -> np.ndarray:
        if not self.model:
            raise ValueError("Model not loaded")
        
        print(f"Generating embeddings for {len(text)} texts...")
        embeddings= self.model.encode(text, show_progress_bar=True)
        print(f"Generated embeddings with shape:{embeddings.shape}")
        return embeddings
    


### we can intialize the embedding manager 

embedding_manager= EmbeddingManager()
embedding_manager


# Embedding size is fixed inside the model architecture, not decided at runtime.

# all-MiniLM-L6-v2 was trained to output vectors of length 384.



Loading embedding model:all-MiniLM-L6-v2


To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Loading weights: 100%|██████████| 103/103 [00:00<00:00, 1016.44it/s, Materializing param=pooler.dense.weight]                             
BertModel LOAD REPORT from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

Notes:
- UNEXPECTED	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.


Model loaded successfully, Embedding dimension:384


<__main__.EmbeddingManager at 0x23daf4e55e0>