# RAG Assignment Implementation

## Problem Statement
This notebook implements a Retrieval-Augmented Generation (RAG) pipeline to answer questions based on a provided company policy document (`knowledge_base.txt`).

## 1. Setup and Imports
We will use `langchain`, `faiss-cpu`, and `sentence-transformers`.

In [1]:
import os
from typing import List
from langchain_community.document_loaders import TextLoader
try:
    from langchain.text_splitter import RecursiveCharacterTextSplitter
except ImportError:
    from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
# from langchain_openai import OpenAI
# from langchain.chains import RetrievalQA

# Optional: Set OpenAI API Key if available
# os.environ["OPENAI_API_KEY"] = "sk-..."

## 2. Data Loading
We load the text from `knowledge_base.txt`.

In [2]:
# Define the path to the knowledge base
file_path = 'knowledge_base.txt'

# Check if file exists
if not os.path.exists(file_path):
    # Create dummy data if not present (for standalone execution)
    with open(file_path, 'w') as f:
        f.write("""Remote Work Policy - Acme Corp
Effective Date: January 1, 2024

1. Purpose
The purpose of this Remote Work Policy is to outline the guidelines and expectations for employees working remotely. Acme Corp recognizes the benefits of remote work in promoting work-life balance and productivity.

2. Eligibility
Full-time employees who have completed their probationary period are eligible to apply for remote work. Roles that require physical presence (e.g., hardware maintenance, front-desk reception) are not eligible.

3. Work Hours & Availability
Remote employees must be available during core business hours (10:00 AM - 4:00 PM EST). Employees are expected to maintain the same level of productivity and responsiveness as they would in the office.

4. Equipment & Security
Acme Corp will provide a company laptop and necessary software. Employees must ensure their home Wi-Fi network is secure and password-protected. Use of public Wi-Fi for handling sensitive company data is strictly prohibited unless a VPN is used.

5. Communication
Employees should use Slack for asynchronous communication and Zoom for meetings. Weekly check-ins with managers are mandatory.

6. Expense Reimbursement
Acme Corp will reimburse up to $50/month for internet expenses. Home office furniture or electricity costs are not reimbursable.

7. Termination of Remote Work
Acme Corp reserves the right to terminate remote work agreements at any time if performance standards are not met or business needs change.""")

loader = TextLoader(file_path)
documents = loader.load()
print(f"Loaded {len(documents)} document(s).")
print(f"Content preview: {documents[0].page_content[:200]}...")

Loaded 1 document(s).
Content preview: Remote Work Policy - Acme Corp
Effective Date: January 1, 2024

1. Purpose
The purpose of this Remote Work Policy is to outline the guidelines and expectations for employees working remotely. Acme Cor...


## 3. Text Chunking Strategy
We use `RecursiveCharacterTextSplitter` with a chunk size of 500 characters and 50 character overlap.
- **Chunk Size (500)**: Small enough to capture specific rules (e.g., "Eligibility") without mixing unrelated topics.
- **Overlap (50)**: Ensures context is preserved if a sentence is split across chunks.

In [3]:
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separators=["\n\n", "\n", " ", ""]
)

chunks = text_splitter.split_documents(documents)
print(f"Split into {len(chunks)} chunks.")
print(f"Example chunk: {chunks[0].page_content}")

Split into 4 chunks.
Example chunk: Remote Work Policy - Acme Corp
Effective Date: January 1, 2024

1. Purpose
The purpose of this Remote Work Policy is to outline the guidelines and expectations for employees working remotely. Acme Corp recognizes the benefits of remote work in promoting work-life balance and productivity.


## 4. Embedding Details
We use `HuggingFaceEmbeddings` with the model `all-MiniLM-L6-v2`.
- **Reason**: It is a free, open-source model that provides high-quality dense vector representations (384-dimensional). It runs efficiently on CPU.

In [4]:
# Initialize Embedding Model
embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")

  embedding_model = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L6-v2")


Loading weights:   0%|          | 0/103 [00:00<?, ?it/s]

[1mBertModel LOAD REPORT[0m from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

[3mNotes:
- UNEXPECTED[3m	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.[0m


## 5. Vector Database
We use **FAISS** (Facebook AI Similarity Search) as our vector store.
- **Reason**: Optimized for fast similarity search and clustering of dense vectors. It is easy to install locally and doesn't require an external server.

In [5]:
# Create Vector Store
vector_store = FAISS.from_documents(chunks, embedding_model)
print("Vector store created successfully.")

Vector store created successfully.


## 6. Retrieval and Generation
We perform retrieval using `similarity_search` and generation using an LLM.

### Mock LLM Implementation (for demonstration without API Key)
Since we may not have a live API key in this environment, we define a Mock LLM to simulate the response generation step.

In [6]:
class MockLLM:
    def __init__(self, vector_store):
        self.vector_store = vector_store

    def answer_question(self, query):
        # 1. Retrieve relevant docs
        docs = self.vector_store.similarity_search(query, k=2)
        context = "\n".join([doc.page_content for doc in docs])
        
        # 2. Simulate Generation (In a real scenario, this context goes to OpenAI/Llama)
        response = f"""
        [Generated Answer based on Context]
        Based on the policy:
        - Context Found: {docs[0].page_content[:100]}...
        
        (This is a mock response. To use a real LLM, uncomment the OpenAI section below.)
        """
        return response, docs

# Real LLM Integration (Commented out for safety)
# llm = OpenAI(temperature=0)
# qa_chain = RetrievalQA.from_chain_type(llm=llm, chain_type="stuff", retriever=vector_store.as_retriever())

## 7. Testing
We test the system with 3 different queries.

In [7]:
rag_system = MockLLM(vector_store)

test_queries = [
    "What is the eligibility for remote work?",
    "Does the company pay for internet?",
    "What are the core work hours?"
]

print("--- Test Results ---")
for query in test_queries:
    print(f"\nQuery: {query}")
    answer, source_docs = rag_system.answer_question(query)
    print(f"Answer: {answer}")
    print(f"Source: {source_docs[0].page_content[:150]}...")

--- Test Results ---

Query: What is the eligibility for remote work?
Answer: 
        [Generated Answer based on Context]
        Based on the policy:
        - Context Found: 2. Eligibility
Full-time employees who have completed their probationary period are eligible to appl...
        
        (This is a mock response. To use a real LLM, uncomment the OpenAI section below.)
        
Source: 2. Eligibility
Full-time employees who have completed their probationary period are eligible to apply for remote work. Roles that require physical pre...

Query: Does the company pay for internet?
Answer: 
        [Generated Answer based on Context]
        Based on the policy:
        - Context Found: 6. Expense Reimbursement
Acme Corp will reimburse up to $50/month for internet expenses. Home office...
        
        (This is a mock response. To use a real LLM, uncomment the OpenAI section below.)
        
Source: 6. Expense Reimbursement
Acme Corp will reimburse up to $50/month for internet 