<a href="https://colab.research.google.com/github/kumarsirish/rag-workshop/blob/main/college-department-rag.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# RAG (Retrieval-Augmented Generation) System
## Fictional Undergrad Department - DISE

This notebook demonstrates building a basic RAG pipeline for question-answering about a fictional department (DISE).

## What We'll Build:

1. **Document Embedding** - Convert text documents into vector representations using `all-MiniLM-L6-v2`
2. **Vector Index** - Create a FAISS index for fast similarity search
3. **Retrieval** - Find relevant documents based on user queries using cosine similarity
4. **Generation** - Use Llama 3.1-8B (via HuggingFace API) to generate answers from retrieved context

## Key Technologies:
- **SentenceTransformers**: Open-source embedding model
- **FAISS**: Facebook's similarity search library
- **HuggingFace Inference API**: Free LLM access without local model download

## Simplifications:
- No document chunking (documents are already small)
- No authentication required (using free HF inference)
- Minimal preprocessing for educational clarity

## Setup Instructions (Optional)

### Getting a HuggingFace Token:

While this notebook uses HuggingFace's free inference API that works without authentication for some models, having a token provides better rate limits and access to more models.

**Steps to get your free HF token:**

1. Visit [https://huggingface.co/](https://huggingface.co/) and sign up (free)
2. Go to [https://huggingface.co/settings/tokens](https://huggingface.co/settings/tokens)
3. Click **"New token"** → Give it a name → Select **"Read"** access
4. Copy the generated token
5. Set it in your environment:
   ```bash
   export HF_TOKEN="your_token_here"
   ```
   Or in Python:
   ```python
   import os
   os.environ["HF_TOKEN"] = "your_token_here"
   ```
   Or if running in Google Collab, then copy the token to the secrets with key as HF_TOKEN
   

**Note**: Add a token or you may encounter rate limits or want access to gated models.

In [None]:
! pip install faiss-cpu sentence-transformers
import faiss #Facebook AI Similarity Search
from sentence_transformers import SentenceTransformer
from huggingface_hub import InferenceClient


# Load open-source embedding model
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
print(f"Loaded embedding model: all-MiniLM-L6-v2")


In [None]:
#Sample documents to index

fictious_department_info = [
"The Department of Intelligent Systems Engineering (DISE) is a small, focused department that works on applied AI and intelligent systems.",
"It currently has around 40 students, with a healthy mix of undergraduate and postgraduate learners.",
"The department is run by a team of 14 professors, including experienced faculty members and a few industry practitioners.",
"Students can choose from about 5 courses, ranging from core subjects to electives and hands-on project work.",
"Overall, DISE aims to prepare students for real-world engineering roles through practical learning and industry exposure."
]

In [None]:
#Generate embeddings for the documents using sentence-transformers
def generate_embeddings(documents):
    embeddings = embedding_model.encode(documents, convert_to_numpy=True)
    return embeddings.astype("float32") #float32 array is required for FAISS

In [None]:
embeddings = generate_embeddings(fictious_department_info)
dimension = embeddings.shape[1]
print(f"Generated Embeddings: {embeddings}, Shape: {embeddings.shape}")

In [None]:
#Create FAISS index with ineer product metric - for cosine similarity
faiss.normalize_L2(embeddings)
index = faiss.IndexFlatIP(dimension)
index.add(embeddings)
print(f"Number of documents indexed: {index.ntotal}")

In [None]:
# Search FAISS index for top_k documents similar to the query
def search_index(query, top_k=2):
    query_embedding = generate_embeddings([query])
    faiss.normalize_L2(query_embedding) # Normalize the query embedding to unit length
    distances, indices = index.search(query_embedding, top_k)

    results = []
    for idx, i in enumerate(indices[0]):
        results.append((fictious_department_info[i], distances[0][idx]))

    return results

In [None]:
user_query = "Whats the full form of DISE?"
#user_query = "How many students in DISE"
#user_query = "how many casual leaves can i get"

In [None]:
#retrieve documents based on the query
retrieved_docs = search_index(user_query, top_k=2)
print("Top retrieved documents:")
for doc, score in retrieved_docs:
    print(f"Document: {doc}, Score: {score}")


In [None]:
#Build prompt with retrieved documents
def build_prompt(query, retrieved_docs):
    context_lines = []
    for doc, _ in retrieved_docs:
        context_lines.append(f"- {doc}")
    context = "\n".join(context_lines)
    prompt = f"Based on the following documents:\n{context}\nAnswer the question: {query}"
    # Removed the line that was overwriting the prompt with a simpler version:
    # prompt = f"Answer the question : {query}"
    return prompt

In [None]:
# Build prompt
user_prompt = build_prompt(user_query, retrieved_docs)
rag_prompt = [
    {"role": "system", "content": "You are a helpful assistant. You must always answer the question asked"},
    # {"role": "system", "content": "You are a helpful assistant. Strictly use the provided documents to answer the user's question."},
     {"role": "user", "content": user_prompt}

]

simple_prompt = [
    {"role": "system", "content": "You are a helpful assistant. You must always answer"},
      {"role": "user", "content": user_query}

]

print(f"Constructed Prompt: {rag_prompt}")
print(f"Simple prompt: {simple_prompt}")


Reference: https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct
<br><b>
The size of the model in terms of parameters is 8B.  '8B' means it has 8 billion parameters. This is considered a relatively small model compared to larger versions (like 70B or even larger proprietary models), making it efficient for deployment on consumer hardware or for tasks where speed and lower resource usage are critical, while still offering good capabilities.
</b>

In [None]:
def get_response(prompt):
  response = client.chat_completion(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=prompt,
    max_tokens=150,
    temperature=0.9
  )
  answer = response.choices[0].message.content
  return answer

In [None]:
import os
from google.colab import userdata

# Use HuggingFace Inference API (free, no auth required for some models)

#hf_token = userdata.get('HF_TOKEN').strip()

#client = InferenceClient(token=hf_token)
client = InferenceClient()
response = get_response(rag_prompt)
print(f"Generated Answer with RAG: {response}")

response = get_response(simple_prompt)
print(f"\nGenerated Answer without RAG: {response}")