<a href="https://colab.research.google.com/github/itsganeshhere/Chat-with-PDF-Using-RAG-Pipeline/blob/main/SITHAFAL_TASK1.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [52]:
!pip install langchain PyPDF2 openai faiss-cpu
!pip install langchain-community



#Importing Libraries

In [53]:
import os
from PyPDF2 import PdfReader
from sentence_transformers import SentenceTransformer
import faiss
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

# 1. PDF Data Ingestion

In [54]:
def extract_text_from_pdf(pdf_path):
    reader = PdfReader(pdf_path)
    text = ""
    for page in reader.pages:
        text += page.extract_text() + "\n"
    return text

def split_text_into_chunks(text, chunk_size=500, overlap=50):
    chunks = []
    for i in range(0, len(text), chunk_size - overlap):
        chunks.append(text[i : i + chunk_size])
    return chunks

# 2. Vector Database and Embedding

In [55]:
def create_vector_store(chunks, model_name):
    model = SentenceTransformer(model_name)
    embeddings = model.encode(chunks)

    # Initialize FAISS index
    dimension = embeddings.shape[1]
    index = faiss.IndexFlatL2(dimension)
    index.add(embeddings)
    return index, embeddings, chunks

def query_vector_store(index, chunks, query, model_name, top_k=5):
    model = SentenceTransformer(model_name)
    query_vector = model.encode([query])
    distances, indices = index.search(query_vector, top_k)
    return [chunks[i] for i in indices[0]]

# 3. LLM for Response Generation

In [56]:
def generate_response(retrieved_chunks, query, model_name="google/flan-t5-large"):
    context = "\n".join(retrieved_chunks)
    prompt = f"Answer the question based on the following context:\n\n{context}\n\nQuestion: {query}"


    tokenizer = AutoTokenizer.from_pretrained(model_name)
    model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

    inputs = tokenizer(prompt, return_tensors="pt", max_length=1024, truncation=True)
    outputs = model.generate(
        inputs.input_ids, max_length=300, num_return_sequences=1, temperature=0.3
    )
    response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return response

# Example Workflow

In [57]:

pdf_path = input("Enter the pdf path :")
print(pdf_path)
embedding_model = "all-MiniLM-L6-v2"

Enter the pdf path/content/SITHAFAL_DOCUMENT.pdf
/content/SITHAFAL_DOCUMENT.pdf


### Step 1: Extract text from PDF and split into chunks

In [58]:
print("Extracting text from pdf")
text = extract_text_from_pdf(pdf_path)
chunks = split_text_into_chunks(text)

Extracting text from PDF


### Step 2: Create vector store

In [59]:
print("Creating vector store :")
vector_store, embeddings, chunks = create_vector_store(chunks, embedding_model)
print(vector_store)

Creating vector store...
<faiss.swigfaiss_avx2.IndexFlatL2; proxy of <Swig Object of type 'faiss::IndexFlatL2 *' at 0x7ac5bce88c60> >


### Step 3: Handle user query

In [60]:
query = input("Enter the Query Related to pdf:")
print("Retrieving relevant chunks :")
retrieved_chunks = query_vector_store(vector_store, chunks, query, embedding_model)


Enter the Query Related to pdf:Example from Everyday Life?
Retrieving relevant chunks...


### Step 4: Generate response

In [61]:
print("Generating response")
response = generate_response(retrieved_chunks, query)
print("Response :", response)

Generating response
Response: 19% 10% 15% 5%26%25%Family Budget of $31,000 Other Recreation Transportation Clothing housing Food
