# Project 2: AI Handbook Assistant using RAG (Retrieval-Augmented Generation)

**Submitted by:** Nishant Roy  
**Course:** Gen AI & LLMs  

### 🎯 Objective:
To create a simple RAG pipeline that can answer user questions based on the content of provided documents (e.g., Python Basics, Git Commands).  
The system retrieves relevant chunks from the text files and uses them to generate factual answers.


In [2]:
# Importing necessary libraries
from sentence_transformers import SentenceTransformer, util
import numpy as np
import os

print("✅ Libraries imported successfully")


✅ Libraries imported successfully


In [3]:
# Load and Chunk Documents

# Folder where your files are uploaded (in Colab it's usually /content)
base_path = "/content"

# List of handbook files
files = ["python_basics.txt", "git_commands.txt", "general_notes.txt"]

# Read and store all file contents
documents = {}

for file in files:
    with open(os.path.join(base_path, file), 'r', encoding='utf-8') as f:
        content = f.read()
        documents[file] = content

# Show loaded file names
print("📂 Loaded Documents:")
for name in documents:
    print("-", name)


📂 Loaded Documents:
- python_basics.txt
- git_commands.txt
- general_notes.txt


In [4]:
# Split documents into smaller chunks (for retrieval)

def chunk_text(text, chunk_size=3):
    """Splits text into chunks of given number of lines."""
    lines = text.split('\n')
    chunks = []
    for i in range(0, len(lines), chunk_size):
        chunk = " ".join(lines[i:i + chunk_size])
        chunks.append(chunk)
    return chunks

# Chunk all documents
chunks = []
for file_name, text in documents.items():
    for chunk in chunk_text(text):
        chunks.append((file_name, chunk))

print(f"✅ Total Chunks Created: {len(chunks)}")
print("\nSample Chunk:\n", chunks[0])


✅ Total Chunks Created: 8

Sample Chunk:
 ('python_basics.txt', 'Python is a popular high-level programming language.   It uses indentation to define code blocks.   Variables in Python are dynamically typed.  ')


In [5]:
# Create vector embeddings for all chunks

# Load a pre-trained embedding model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Extract all text chunks
texts = [chunk[1] for chunk in chunks]

# Generate embeddings
embeddings = model.encode(texts, convert_to_tensor=True)

print(f"✅ Embeddings created successfully for {len(texts)} chunks.")


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md: 0.00B [00:00, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt: 0.00B [00:00, ?B/s]

tokenizer.json: 0.00B [00:00, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

✅ Embeddings created successfully for 8 chunks.


In [6]:
# Retrieve relevant chunks based on a user query

def retrieve_relevant_chunks(query, top_k=2):
    """Retrieve top_k most relevant chunks for a given query."""
    query_embedding = model.encode(query, convert_to_tensor=True)
    similarity_scores = util.cos_sim(query_embedding, embeddings)[0]

    # Get top k chunk indices
    top_results = np.argsort(-similarity_scores)[:top_k]

    # Collect matching chunks
    retrieved_chunks = [chunks[i] for i in top_results]
    return retrieved_chunks

# Example query test
query = "How do I create a Python function?"
results = retrieve_relevant_chunks(query)

print("🔎 Retrieved Relevant Chunks:\n")
for file_name, chunk in results:
    print(f"📘 From: {file_name}\n{chunk}\n")


🔎 Retrieved Relevant Chunks:

📘 From: python_basics.txt
Functions are defined using the def keyword.   Lists, tuples, and dictionaries are common data structures.   Modules can be imported using the import statement.  

📘 From: python_basics.txt
Python is a popular high-level programming language.   It uses indentation to define code blocks.   Variables in Python are dynamically typed.  



In [7]:
# Generate a final answer using retrieved context

def generate_answer(query):
    """Simulated answer generation using retrieved context."""
    retrieved_chunks = retrieve_relevant_chunks(query, top_k=2)

    print("🔍 Retrieved context:")
    for file_name, chunk in retrieved_chunks:
        print(f"\n📘 From {file_name}:\n{chunk}\n")

    # Combine retrieved chunks as context
    context = " ".join([chunk for _, chunk in retrieved_chunks])

    # Simulate an AI-generated answer (in your report, explain this is a mock generation)
    print("🤖 AI Handbook Assistant:\n")
    print(f"Based on the handbook data, here's what I found about '{query}':\n")
    print(context)
    print("\n✅ This answer is grounded in the uploaded documents.\n")

# Example test query
generate_answer("What is Git and how do I use git commit?")


🔍 Retrieved context:

📘 From git_commands.txt:
Git is a version control system.   Use git init to initialize a new repository.   git add stages changes, and git commit saves them.  


📘 From git_commands.txt:
git status shows modified files.   git push uploads commits to a remote repository.   git pull fetches and merges updates from the remote.  

🤖 AI Handbook Assistant:

Based on the handbook data, here's what I found about 'What is Git and how do I use git commit?':

Git is a version control system.   Use git init to initialize a new repository.   git add stages changes, and git commit saves them.   git status shows modified files.   git push uploads commits to a remote repository.   git pull fetches and merges updates from the remote.  

✅ This answer is grounded in the uploaded documents.



In [8]:
# Test the RAG Assistant with Multiple Queries

queries = [
    "What is Git used for?",
    "How do I define a function in Python?",
    "What is version control?",
    "What should I learn to become better at coding?"
]

for q in queries:
    print("=" * 80)
    print(f"🧑‍💻 User Query: {q}\n")
    generate_answer(q)


🧑‍💻 User Query: What is Git used for?

🔍 Retrieved context:

📘 From git_commands.txt:
Git is a version control system.   Use git init to initialize a new repository.   git add stages changes, and git commit saves them.  


📘 From git_commands.txt:
git status shows modified files.   git push uploads commits to a remote repository.   git pull fetches and merges updates from the remote.  

🤖 AI Handbook Assistant:

Based on the handbook data, here's what I found about 'What is Git used for?':

Git is a version control system.   Use git init to initialize a new repository.   git add stages changes, and git commit saves them.   git status shows modified files.   git push uploads commits to a remote repository.   git pull fetches and merges updates from the remote.  

✅ This answer is grounded in the uploaded documents.

🧑‍💻 User Query: How do I define a function in Python?

🔍 Retrieved context:

📘 From python_basics.txt:
Functions are defined using the def keyword.   Lists, tuples, and dict