<a href="https://colab.research.google.com/github/manisht21/rag-mini-project/blob/main/rag_mini_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task
Build a Retrieval Augmented Generation (RAG) system using company policy documents (PDF/TXT/Markdown) to answer questions, covering document loading, cleaning, chunking, embedding, vector database storage, semantic retrieval, prompt engineering, and evaluation.

**Mount Google Drive**

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


**Create Project Folder Structure**

In [2]:
import os

base_path = "/content/drive/MyDrive/rag_project"
data_path = os.path.join(base_path, "data")
notebook_path = os.path.join(base_path, "notebook")

os.makedirs(data_path, exist_ok=True)
os.makedirs(notebook_path, exist_ok=True)

print("Created folders:")
print(data_path)
print(notebook_path)

Created folders:
/content/drive/MyDrive/rag_project/data
/content/drive/MyDrive/rag_project/notebook


**Verify Uploaded Policy Documents**

In [3]:
import os

data_path = "/content/drive/MyDrive/rag_project/data"
print(os.listdir(data_path))


['Refund_Policy.docx', 'Shipping_Policy.docx', 'Cancellation_Policy.docx']


**Install DOCX Loader Dependency**

In [4]:
pip install docx2txt




In [5]:
!pip install python-docx




In [6]:
from docx import Document
import os

data_folder = "/content/drive/MyDrive/rag_project/data"
documents = []

for file in os.listdir(data_folder):
    if file.endswith(".docx"):
        doc = Document(os.path.join(data_folder, file))
        documents.append("\n".join(p.text for p in doc.paragraphs))

print(f"Loaded {len(documents)} documents automatically.")


Loaded 3 documents automatically.


**Install LangChain and Community Packages**

In [7]:
pip install -U langchain langchain-community docx2txt




**Import Document Loader**

In [8]:
from langchain_community.document_loaders import Docx2txtLoader



**Load Policy Documents**

In [9]:
from langchain_core.documents import Document as LCDocument

docs = [LCDocument(page_content=text) for text in documents]

print(f"Converted {len(docs)} documents into LangChain format.")


Converted 3 documents into LangChain format.


**Import Text Splitter**

In [10]:
from langchain_text_splitters import RecursiveCharacterTextSplitter


**Chunk Documents into Smaller Pieces**

In [11]:
splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50
)

chunks = splitter.split_documents(docs)
print("Chunks created:", len(chunks))


Chunks created: 3


## Design Trade-offs

### Vector Database Choice
ChromaDB was selected for simplicity and fast prototyping. FAISS could offer better performance for very large datasets, but ChromaDB provides easier integration and persistence.

### Chunk Size Choice
A chunk size of 500 characters balances semantic completeness and retrieval precision. Smaller chunks improve recall but increase embedding count, while larger chunks may mix multiple topics.

### Embedding Model
Lightweight sentence-transformer embeddings were used to reduce memory usage and inference cost.

### LLM Choice
flan-t5-base was selected as an open-source model to avoid API costs. Larger models may improve answer quality but require higher compute.

### Notebook Design
A single notebook simplifies demonstration, while modular Python files would improve maintainability for production systems.


**Initialize Embedding Model**

In [12]:
from langchain_community.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)


  embeddings = HuggingFaceEmbeddings(
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


**Install Chroma Vector Database**

In [13]:
pip install chromadb




**Create Vector Store and Retriever**

In [14]:
def build_vector_store(chunks, embeddings):
    from langchain_community.vectorstores import Chroma
    db = Chroma.from_documents(
        documents=chunks,
        embedding=embeddings,
        persist_directory="./chroma_db"
    )
    return db.as_retriever(search_kwargs={"k":2})


**File Verification**

In [15]:
def verify_files():
    import os
    files = os.listdir("/content/drive/MyDrive/rag_project/data")
    print("Found files:", files)

verify_files()


Found files: ['Refund_Policy.docx', 'Shipping_Policy.docx', 'Cancellation_Policy.docx']


**Logging**

In [16]:
import datetime

def log_query(question):
    t = datetime.datetime.now().strftime("%H:%M:%S")
    print(f"[{t}] Question:", question)


In [17]:
def retrieve_context(question):
    docs = retriever.invoke(question)
    return "\n".join([d.page_content for d in docs])


In [18]:
def ask(question):
    log_query(question)
    context = retrieve_context(question)

    if len(context.strip()) == 0:
        return "Not found in the provided documents."

    return generate_answer(context, question)


**Define Final Prompt Template**

In [19]:
PROMPT = """
You are an AI assistant that answers strictly from company policy documents.

Rules:
1. Use ONLY the information in <context>.
2. If answer not found, say:
   Not found in the provided documents.
3. Do NOT guess or use outside knowledge.
4. Cite evidence.

<context>
{context}
</context>

Question:
{question}

Answer Format:
- Answer:
- Evidence:
"""


**Install Transformers for Free LLM**

In [20]:
pip install transformers accelerate




**Load Open-Source Language Model (FLAN-T5)**

In [21]:
from transformers import pipeline

llm = pipeline(
    "text2text-generation",
    model="google/flan-t5-base",
    max_new_tokens=256
)


Device set to use cpu


**Define Answer Generation Function**

In [22]:
def generate_answer(context, question):
    prompt = PROMPT.format(context=context, question=question)
    output = llm(prompt)[0]["generated_text"]
    return output


In [23]:
def run_evaluation(questions):
    results = []
    for q in questions:
        ans = ask(q)
        results.append({"question": q, "answer": ans})
    return results


**RAG CREATE CHECK CODE**

In [24]:
retriever = build_vector_store(chunks, embeddings)
print("Retriever created")


Retriever created


**Test RAG System with Sample Questions**

In [25]:
print(ask("What is the refund period?"))
print(ask("How long does international shipping take?"))
print(ask("Who is the CEO?"))


[11:50:26] Question: What is the refund period?
7 business days after approval
[11:50:31] Question: How long does international shipping take?
10-15 business days.
[11:50:35] Question: Who is the CEO?
Not found in the provided documents


**Run Evaluation on Multiple Questions**

In [26]:
evaluation_questions = [
    "What is the refund period?",
    "How long does international shipping take?",
    "Is cancellation allowed after shipping?",
    "Are digital products refundable?",
    "Do you ship to Germany?",
    "What payment gateway is used?"
]

evaluation_results = run_evaluation(evaluation_questions)


for r in evaluation_results:
    print("\nQ:", r["question"])
    print("A:", r["answer"])



[11:50:40] Question: What is the refund period?
[11:50:45] Question: How long does international shipping take?
[11:50:49] Question: Is cancellation allowed after shipping?
[11:51:01] Question: Are digital products refundable?
[11:51:05] Question: Do you ship to Germany?
[11:51:17] Question: What payment gateway is used?

Q: What is the refund period?
A: 7 business days after approval

Q: How long does international shipping take?
A: 10-15 business days.

Q: Is cancellation allowed after shipping?
A: Orders can be cancelled within 12 hours of placing the order. Once shipped, orders cannot be cancelled. Refunds for cancelled orders are processed within 5 business days.

Q: Are digital products refundable?
A: Digital products are not eligible for refunds.

Q: Do you ship to Germany?
A: We currently ship to USA, Canada, and India.

Q: What payment gateway is used?
A: Not found in the provided documents


**SAMPLE TEST CASES**

In [29]:
# evaluation_questions = [
    # Reasoning-Based
    "Can I cancel my order after it is shipped?",
    "Are shipping fees refundable?",
    "How long does refund processing take after approval?",

    # Edge / Out-of-Scope
    "Who is the CEO of the company?",
    "What is the company address?",

    # Policy Combination / Tricky
    "Can I cancel within 1 hour of placing order?",
    "Are digital items eligible for refund?",
    "Which countries do you ship to?"
]

# evaluation_results = run_evaluation(evaluation_questions)

# for r in evaluation_results:
    print("\nQ:", r["question"])
    print("A:", r["answer"])


[12:06:19] Question: Can I cancel my order after it is shipped?
[12:06:25] Question: Are shipping fees refundable?
[12:06:28] Question: How long does refund processing take after approval?
[12:06:30] Question: Who is the CEO of the company?
[12:06:33] Question: What is the company address?
[12:06:36] Question: Can I cancel within 1 hour of placing order?
[12:06:37] Question: Are digital items eligible for refund?
[12:06:40] Question: Which countries do you ship to?

Q: Can I cancel my order after it is shipped?
A: Orders can be cancelled within 12 hours of placing the order. Once shipped, orders cannot be cancelled. Refunds for cancelled orders are processed within 5 business days.

Q: Are shipping fees refundable?
A: Non-refundable.

Q: How long does refund processing take after approval?
A: 7 business days

Q: Who is the CEO of the company?
A: Not found in the provided documents

Q: What is the company address?
A: Not found in the provided documents

Q: Can I cancel within 1 hour of 