Step 1: Install Required Libraries

libraries: langchain, huggingface_hub, faiss-cpu, PyPDF2 (or pdfplumber, depending on our preference)

In [1]:
pip install transformers sentence-transformers faiss-cpu PyPDF2 langchain

Note: you may need to restart the kernel to use updated packages.




Step 2: Load PDF with LangChain

We’ll use LangChain’s built-in function of PyPDFLoader to ingest a sample document. I used a pdf for training the model.

In [2]:
pdf_paths = [
    "D:\\Neeru\\Python & DataScience\\Live projects\\LLM-powered Q&A assistant for retinal diseases\\RP_pdfs\\RPDoc1.pdf",
    "D:\\Neeru\\Python & DataScience\\Live projects\\LLM-powered Q&A assistant for retinal diseases\\RP_pdfs\\RPDoc2.pdf",
    "D:\\Neeru\\Python & DataScience\\Live projects\\LLM-powered Q&A assistant for retinal diseases\\RP_pdfs\\RPDoc3.pdf",
    "D:\\Neeru\\Python & DataScience\\Live projects\\LLM-powered Q&A assistant for retinal diseases\\RP_pdfs\\RPDoc4.pdf",
    "D:\\Neeru\\Python & DataScience\\Live projects\\LLM-powered Q&A assistant for retinal diseases\\RP_pdfs\\RPDoc5.pdf",
    "D:\\Neeru\\Python & DataScience\\Live projects\\LLM-powered Q&A assistant for retinal diseases\\RP_pdfs\\RPDoc6.pdf",
    "D:\\Neeru\\Python & DataScience\\Live projects\\LLM-powered Q&A assistant for retinal diseases\\RP_pdfs\\RPDoc7.pdf",
    "D:\\Neeru\\Python & DataScience\\Live projects\\LLM-powered Q&A assistant for retinal diseases\\RP_pdfs\\RPDoc8.pdf",
    "D:\\Neeru\\Python & DataScience\\Live projects\\LLM-powered Q&A assistant for retinal diseases\\RP_pdfs\\RPDoc9.pdf",
    "D:\\Neeru\\Python & DataScience\\Live projects\\LLM-powered Q&A assistant for retinal diseases\\RP_pdfs\\RPDoc10.pdf",
    "D:\\Neeru\\Python & DataScience\\Live projects\\LLM-powered Q&A assistant for retinal diseases\\RP_pdfs\\RPDoc11.pdf",
    "D:\\Neeru\\Python & DataScience\\Live projects\\LLM-powered Q&A assistant for retinal diseases\\RP_pdfs\\RPDoc12.pdf",
    "D:\\Neeru\\Python & DataScience\\Live projects\\LLM-powered Q&A assistant for retinal diseases\\RP_pdfs\\RPDoc13.pdf",
    "D:\\Neeru\\Python & DataScience\\Live projects\\LLM-powered Q&A assistant for retinal diseases\\RP_pdfs\\RPDoc14.pdf",
    "D:\\Neeru\\Python & DataScience\\Live projects\\LLM-powered Q&A assistant for retinal diseases\\RP_pdfs\\RPDoc15.pdf",
    "D:\\Neeru\\Python & DataScience\\Live projects\\LLM-powered Q&A assistant for retinal diseases\\RP_pdfs\\RPDoc16.pdf",
    "D:\\Neeru\\Python & DataScience\\Live projects\\LLM-powered Q&A assistant for retinal diseases\\RP_pdfs\\RPDoc17.pdf",
    "D:\\Neeru\\Python & DataScience\\Live projects\\LLM-powered Q&A assistant for retinal diseases\\RP_pdfs\\RPDoc18.pdf",
    "D:\\Neeru\\Python & DataScience\\Live projects\\LLM-powered Q&A assistant for retinal diseases\\RP_pdfs\\RPDoc19.pdf",
    "D:\\Neeru\\Python & DataScience\\Live projects\\LLM-powered Q&A assistant for retinal diseases\\RP_pdfs\\RPDoc20.pdf",
    "D:\\Neeru\\Python & DataScience\\Live projects\\LLM-powered Q&A assistant for retinal diseases\\RP_pdfs\\RPDoc21.pdf",
    "D:\\Neeru\\Python & DataScience\\Live projects\\LLM-powered Q&A assistant for retinal diseases\\RP_pdfs\\RPDoc22.pdf",
    "D:\\Neeru\\Python & DataScience\\Live projects\\LLM-powered Q&A assistant for retinal diseases\\RP_pdfs\\RPDoc23.pdf",
    "D:\\Neeru\\Python & DataScience\\Live projects\\LLM-powered Q&A assistant for retinal diseases\\RP_pdfs\\RPDoc24.pdf",
    "D:\\Neeru\\Python & DataScience\\Live projects\\LLM-powered Q&A assistant for retinal diseases\\RP_pdfs\\RPDoc25.pdf",
    "D:\\Neeru\\Python & DataScience\\Live projects\\LLM-powered Q&A assistant for retinal diseases\\RP_pdfs\\RPDoc26.pdf",
    "D:\\Neeru\\Python & DataScience\\Live projects\\LLM-powered Q&A assistant for retinal diseases\\RP_pdfs\\RPDoc27.pdf",
    "D:\\Neeru\\Python & DataScience\\Live projects\\LLM-powered Q&A assistant for retinal diseases\\RP_pdfs\\RPDoc28.pdf",
    "D:\\Neeru\\Python & DataScience\\Live projects\\LLM-powered Q&A assistant for retinal diseases\\RP_pdfs\\RPDoc29.pdf",
    "D:\\Neeru\\Python & DataScience\\Live projects\\LLM-powered Q&A assistant for retinal diseases\\RP_pdfs\\RPDoc30.pdf"
     
]

from langchain.document_loaders import PyPDFLoader

documents = []

for path in pdf_paths:
    loader = PyPDFLoader(path)
    documents.extend(loader.load())


In [3]:
# from langchain.document_loaders import PyPDFLoader

# loader = PyPDFLoader("D:\\Neeru\\Python & DataScience\\Live projects\\RPDoc1.pdf")
# documents = loader.load()

You should now have a list of text chunks from the PDF. LangChain simply splits the data for you, no messy tokenizing needed

Step 3: Embed with Sentence Transformers

Now let’s turn those text chunks into vectors using Hugging Face’s MiniLM model. This model is fast, free, and solid for most Q&A setups.

In [4]:
from langchain.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")

  embeddings = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
  from .autonotebook import tqdm as notebook_tqdm


Step 4: Build Vector Store (FAISS)

FAISS lets you do fast similarity searches. We’ll store our embedded documents here.

In [5]:
from langchain.vectorstores import FAISS

vector_store = FAISS.from_documents(documents, embeddings)
retriever = vector_store.as_retriever()


You got a search index for your PDF.

Step 5: Load Local LLM (e.g., Flan-T5 or GPT2)

In [6]:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM, pipeline
from langchain.llms.base import LLM
from typing import Optional, List

class LocalLLM(LLM):
    model_name: str = "google/flan-t5-base"
    pipeline: Optional[object] = None

    def __init__(self, model_name: str = "google/flan-t5-base"):
        super().__init__()
        tokenizer = AutoTokenizer.from_pretrained(model_name)
        model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
        self.pipeline = pipeline("text2text-generation", model=model, tokenizer=tokenizer)
        self.model_name = model_name

    def _call(self, prompt: str, stop: Optional[List[str]] = None) -> str:
        output = self.pipeline(prompt, max_new_tokens=300)
        return output[0]['generated_text']

    @property
    def _llm_type(self) -> str:
        return "custom-local-llm"


 step 6: Initialize the LLM

In [7]:
llm = LocalLLM("google/flan-t5-base")  # or flan-t5-small if CPU is used


Device set to use cpu


7. Create the RetrievalQA Chain

In [8]:
from langchain.chains import RetrievalQA

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="map_reduce",  # instead of "stuff"
    retriever=retriever
)


8. Ask a Question

In [9]:
# Ask question based on your mood
query = input("Ask your question: ")
# Get answer from QA chain
response = qa_chain.run(query)
# Print the Q&A
print("Q:", query)
print("A:", response)
# Save to file
with open("qa_log.txt", "a") as file:
    file.write(f"Q: {query}\nA: {response}\n\n")

  response = qa_chain.run(query)
Token indices sequence length is longer than the specified maximum sequence length for this model (5448 > 512). Running this sequence through the model will result in indexing errors
Token indices sequence length is longer than the specified maximum sequence length for this model (1493 > 1024). Running this sequence through the model will result in indexing errors


Q: advancements
A: PLoS One 7, e45798. A. Anasagasti et al. / Experimental Eye Research 116 (2013) 386e394394
