<a href="https://colab.research.google.com/github/manisht21/rag-mini-project/blob/main/rag_mini_project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Task
Build a Retrieval Augmented Generation (RAG) system using company policy documents (PDF/TXT/Markdown) to answer questions, covering document loading, cleaning, chunking, embedding, vector database storage, semantic retrieval, prompt engineering, and evaluation.

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
import os

base_path = "/content/drive/MyDrive/rag_project"
data_path = os.path.join(base_path, "data")
notebook_path = os.path.join(base_path, "notebook")

os.makedirs(data_path, exist_ok=True)
os.makedirs(notebook_path, exist_ok=True)

print("Created folders:")
print(data_path)
print(notebook_path)

Created folders:
/content/drive/MyDrive/rag_project/data
/content/drive/MyDrive/rag_project/notebook


In [3]:
import os

data_path = "/content/drive/MyDrive/rag_project/data"
print(os.listdir(data_path))


['Refund_Policy.docx', 'Cancellation_Policy.docx', 'Shipping_Policy.docx']


In [4]:
pip install docx2txt




In [5]:
pip install -U langchain langchain-community docx2txt




In [6]:
from langchain_community.document_loaders import Docx2txtLoader



In [7]:
docs = []
docs += Docx2txtLoader("/content/drive/MyDrive/rag_project/data/Refund_Policy.docx").load()
docs += Docx2txtLoader("/content/drive/MyDrive/rag_project/data/Cancellation_Policy.docx").load()
docs += Docx2txtLoader("/content/drive/MyDrive/rag_project/data/Shipping_Policy.docx").load()

print("Documents loaded:", len(docs))


Documents loaded: 3


In [8]:
from langchain_text_splitters import RecursiveCharacterTextSplitter


In [9]:
splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50
)

chunks = splitter.split_documents(docs)
print("Chunks created:", len(chunks))


Chunks created: 3


In [10]:
from langchain_community.embeddings import HuggingFaceEmbeddings

embeddings = HuggingFaceEmbeddings(
    model_name="sentence-transformers/all-MiniLM-L6-v2"
)


  embeddings = HuggingFaceEmbeddings(
The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


In [11]:
pip install chromadb




In [12]:
from langchain_community.vectorstores import Chroma

db = Chroma.from_documents(chunks, embeddings)
retriever = db.as_retriever(search_kwargs={"k":4})


In [13]:
import os
from google.colab import userdata

os.environ["OPENAI_API_KEY"] = userdata.get("OPENAI_API_KEY")
print("Key Loaded:", os.environ["OPENAI_API_KEY"] is not None)


Key Loaded: True


In [14]:
PROMPT = """
You are an AI assistant that answers strictly from company policy documents.

Rules:
1. Use ONLY the information in <context>.
2. If answer not found, say:
   Not found in the provided documents.
3. Do NOT guess or use outside knowledge.
4. Cite evidence.

<context>
{context}
</context>

Question:
{question}

Answer Format:
- Answer:
- Evidence:
"""


In [23]:
pip install transformers accelerate




In [25]:
from transformers import pipeline

llm = pipeline(
    "text2text-generation",
    model="google/flan-t5-base",
    max_new_tokens=256
)


Device set to use cpu


In [26]:
def generate_answer(context, question):
    prompt = PROMPT.format(context=context, question=question)
    output = llm(prompt)[0]["generated_text"]
    return output


In [27]:
print(ask("What is the refund period?"))
print(ask("How long does international shipping take?"))
print(ask("Who is the CEO?"))


Refund Policy Customers may request a refund within 30 days of purchase. The product must be unused and in original packaging. Refunds are processed within 7 business days after approval. Shipping fees are non-refundable. Digital products are not eligible for refunds. Refunds for cancelled orders are processed within 5 business days. Shipping Policy Orders are processed within 2 business days. Standard shipping takes 5-7 business days. International shipping takes 10-15 business days.
10-15 business days
Not found in the provided documents


In [28]:
evaluation_questions = [
    "What is the refund period?",
    "How long does international shipping take?",
    "Is cancellation allowed after shipping?",
    "Are digital products refundable?",
    "Do you ship to Germany?",
    "What payment gateway is used?"
]

for q in evaluation_questions:
    print("\nQ:", q)
    print(ask(q))



Q: What is the refund period?
Refund Policy Customers may request a refund within 30 days of purchase. The product must be unused and in original packaging. Refunds are processed within 7 business days after approval. Shipping fees are non-refundable. Digital products are not eligible for refunds. Refunds for cancelled orders are processed within 5 business days. Shipping Policy Orders are processed within 2 business days. Standard shipping takes 5-7 business days. International shipping takes 10-15 business days.

Q: How long does international shipping take?
10-15 business days

Q: Is cancellation allowed after shipping?
Orders can be cancelled within 12 hours of placing the order. Once shipped, orders cannot be cancelled. Refunds for cancelled orders are processed within 5 business days. Refund Policy Customers may request a refund within 30 days of purchase. The product must be unused and in original packaging. Refunds are processed within 7 business days after approval. Shipping 