In [1]:
from langchain_core.documents import Document

doc=Document(
    page_content="this is my content",
    metadata={'source':'pj'},
    id=1
)

doc.page_content

'this is my content'

In [3]:
### loading data from the pdf

from langchain_community.document_loaders import PyPDFLoader

document=PyPDFLoader(
    file_path=("..\documents\medical_book.pdf")
)

document=document.load()

In [4]:
print(len(document))

759


In [5]:
print(document[4].page_content)

The Gale Encyclopedia of Medicine 2is a medical ref-
erence product designed to inform and educate readers
about a wide variety of disorders, conditions, treatments,
and diagnostic tests. The Gale Group believes the product
to be comprehensive, but not necessarily definitive. It is
intended to supplement, not replace, consultation with a
physician or other healthcare practitioner. While the Gale
Group has made substantial efforts to provide information
that is accurate, comprehensive, and up-to-date, the Gale
Group makes no representations or warranties of any
kind, including without limitation, warranties of mer-
chantability or fitness for a particular purpose, nor does it
guarantee the accuracy, comprehensiveness, or timeliness
of the information contained in this product. Readers
should be aware that the universe of medical knowledge
is constantly growing and changing, and that differences
of medical opinion exist among authorities. Readers are
also advised to seek professional dia

### step 2 chunking dataset

In [10]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=800,
    chunk_overlap=150
)

chunks = splitter.split_documents(document)

In [7]:
chunks[1].metadata

{'producer': 'GPL Ghostscript 9.10',
 'creator': '',
 'creationdate': '2017-05-01T10:37:35-07:00',
 'moddate': '2017-05-01T10:37:35-07:00',
 'title': '',
 'author': '',
 'subject': '',
 'keywords': '',
 'source': '..\\documents\\medical_book.pdf',
 'total_pages': 759,
 'page': 1,
 'page_label': '2'}

In [11]:
len(chunks)

4986

In [15]:
print(chunks[2500].page_content)

Infection of the upper urinary tract involves the spread of
bacteria to the kidney and is called pyelonephritis.
Description
The frequency of bladder infections in humans varies
significantly according to age and sex. The male/female
GALE ENCYCLOPEDIA OF MEDICINE 2 991
Cystitis


In [16]:
## creating embedding model

from langchain_community.embeddings import HuggingFaceEmbeddings

embedding_model=HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)

  embedding_model=HuggingFaceEmbeddings(
Loading weights: 100%|██████████| 103/103 [00:00<00:00, 343.34it/s, Materializing param=pooler.dense.weight]                             
[1mBertModel LOAD REPORT[0m from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

[3mNotes:
- UNEXPECTED[3m	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.[0m


In [17]:
## storing the document in vector db

from langchain_community.vectorstores import FAISS

vector_db=FAISS.from_documents(
    documents=chunks,
    embedding=embedding_model
)

vector_db.save_local(folder_path="../vectorDB",index_name="Faiss_index")

In [18]:
## retrival pipeline 

vector_db = FAISS.load_local(
    folder_path="../vectorDB",
    index_name="Faiss_index",
    embeddings=embedding_model,
    allow_dangerous_deserialization=True
)


In [19]:
retriever = vector_db.as_retriever(
    search_kwargs={"k": 3}
)


## building LLM 


In [None]:
## using groq llama3 model

# from langchain_groq import ChatGroq
# from dotenv import load_dotenv
# import os

# load_dotenv()

# llm = ChatGroq(
#     groq_api_key=os.getenv("GROQ_API_KEY"),
#     model_name="llama3-70b-8192"
# )

In [20]:
import os
from langchain_google_genai import ChatGoogleGenerativeAI
from dotenv import load_dotenv

load_dotenv()

os.environ["GEMINI_API_KEY"] = os.getenv("GEMINI_API_KEY")

model = ChatGoogleGenerativeAI(model="gemini-2.5-flash-lite")

In [28]:
query="what is Cholecystectomy"

docs = retriever.invoke(query)

context="\n".join([doc.page_content for doc in docs])



In [29]:
## testing the rag model 
from langchain.messages import SystemMessage,HumanMessage

message=[
    SystemMessage(content="You are professional medical doctor and you have to answer using the provided context only\n\n"+context),
    HumanMessage(content=query)
]


In [30]:
response = model.invoke(message)

print(response.content)

A cholecystectomy is the surgical removal of the gallbladder.
