## Task: Build a Campus FAQ Chatbot using RAG

### Objective:
Learn how Retrieval-Augmented Generation (RAG) works by building a small chatbot that answers questions about your college using vector embeddings and a mini vector database.

#### Step 0: Setup

1. Install required packages:

In [None]:
pip install streamlit sentence-transformers faiss-cpu numpy

#### Step 1: Prepare the Data

Task: Create a small FAQ dataset with at least 5 Q&A pairs.
Example:

Q: When does the library open?
A: The library opens at 8 AM and closes at 8 PM.

In [30]:
faq_text = """
Q: Where is Lambton College Ottawa located?
Q: ¿Dónde está ubicado Lambton College Ottawa?
A: Lambton College Ottawa is located at 223 Main Street, Ottawa, ON K1S 1C4, on the Saint Paul University campus in the heart of Canada's capital.

Q: How much does student housing cost in Ottawa?
Q: ¿Cuánto cuesta el alojamiento estudiantil en Ottawa?
A: On-campus residence typically costs between $800-$1200 per month including utilities. Off-campus shared apartments range from $600-$900 per month per room.

Q: How does public transportation work in Ottawa for students?
Q: ¿Cómo funciona el transporte público en Ottawa para estudiantes?
A: Ottawa uses OC Transpo buses and O-Train light rail. Students can get a U-Pass for approximately $229 per term. You'll need a Presto card which costs $4.

Q: Where are the cheapest grocery stores for students in Ottawa?
Q: ¿Dónde están los supermercados más baratos para estudiantes en Ottawa?
A: The most affordable grocery stores are No Frills, Food Basics (10% student discount on select days), Walmart, and FreshCo. Avoid Metro and Loblaws as they're more expensive.

Q: Can I work while studying at Lambton College Ottawa?
Q: ¿Puedo trabajar mientras estudio en Lambton College Ottawa?
A: Yes! International students can work off-campus up to 24 hours per week during academic sessions. You can work full-time during scheduled breaks.

Q: What is UHIP and do I need it as an international student?
Q: ¿Qué es UHIP y lo necesito como estudiante internacional?
A: UHIP is mandatory health insurance for international students in Ontario. It covers doctor visits, emergency care, and hospitalization. Your college automatically enrolls you.
"""

# Verify the data
print(f"Total characters: {len(faq_text)}")

Total characters: 1682


Checkpoint:

Students should have a list of questions and answers ready.

#### Step 2: Split Text into Chunks

Task: Split your FAQ into separate lines to treat each Q&A as a chunk.

In [6]:
lines = [line.strip() for line in faq_text.split("\n") if line.strip()]

# Verify the chunks
print(f"Total lines (chunks): {len(lines)}")
print(f"\nFirst 3 lines:")
for i in range(3):
    print(f"{i+1}. {lines[i]}")

Total lines (chunks): 18

First 3 lines:
1. Q: Where is Lambton College Ottawa located?
2. Q: ¿Dónde está ubicado Lambton College Ottawa?
3. A: Lambton College Ottawa is located at 223 Main Street, Ottawa, ON K1S 1C4, on the Saint Paul University campus in the heart of Canada's capital.


Checkpoint:

Ensure each Q&A is a separate element in a Python list.

#### Step 3: Create Embeddings

Task: Convert each line to a vector using SentenceTransformer.

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(lines)

In [32]:
from sentence_transformers import SentenceTransformer

# Load the model
model = SentenceTransformer('all-MiniLM-L6-v2')

# Convert each line to a vector
print("\nCreate embeddings:")
embeddings = model.encode(lines)
print(f"Shape: {embeddings.shape}")
print(f"Each line is now a vector of {embeddings.shape[1]} numbers")


Create embeddings:
Shape: (18, 384)
Each line is now a vector of 384 numbers


#### Step 4: Build the FAISS Index

Task: Store all embeddings in a FAISS vector database.

In [34]:
import faiss
import numpy as np

# Get the dimension of our vectors
dimension = embeddings.shape[1]
print(f"Vector dimension: {dimension}")

# Create FAISS index
index = faiss.IndexFlatL2(dimension)

# Add all embeddings to the index
index.add(np.array(embeddings))

print(f"Total vectors in index: {index.ntotal}")

Vector dimension: 384
Total vectors in index: 18


#### Step 5: Query the Database
Task: Take a user question, convert it to a vector, and find the most relevant FAQ line.

In [36]:
user_question = "Where can I buy cheap groceries?"

# Convert user question to vector
q_emb = model.encode([user_question])

# Search for the most similar vector
D, I = index.search(np.array(q_emb), k=1)

print(f"User question: {user_question}")
print(f"\nMost similar line found:")
print(f"Answer: {lines[I[0][0]]}")
print(f"\nDistance: {D[0][0]:.2f} (lower = more similar)")

User question: Where can I buy cheap groceries?

Most similar line found:
Answer: Q: Where are the cheapest grocery stores for students in Ottawa?

Distance: 0.67 (lower = more similar)


In [24]:
# Find_answer function with confidence threshold

def find_answer_improved(user_question, k=5, threshold=1.5):
    """
    Find answer with confidence threshold
    If distance > threshold, return "I don't know"
    
    Args:
        user_question: The question to search
        k: Number of results to check
        threshold: Maximum distance to consider relevant (default 1.5)
    
    Returns:
        answer: The answer text
        distance: The distance score
    """
    # Convert user question to vector
    q_emb = model.encode([user_question])
    
    # Search for k most similar vectors
    D, I = index.search(np.array(q_emb), k)
    
    # Look for answer line
    for idx, dist in zip(I[0], D[0]):
        line = lines[idx]
        
        # Check if distance is too high (not relevant)
        if dist > threshold:
            return "I don't have information about that topic in my database. Please ask about: housing, transportation, groceries, work permits, or UHIP.", dist
        
        # If we found an answer line
        if line.startswith("A:"):
            return line, dist
        
        # If we found a question, look for answer nearby
        if line.startswith("Q:"):
            if idx + 1 < len(lines) and lines[idx + 1].startswith("A:"):
                return lines[idx + 1], dist
            if idx + 2 < len(lines) and lines[idx + 2].startswith("A:"):
                return lines[idx + 2], dist
    
    return "I couldn't find a relevant answer. Try rephrasing your question.", 999.0

In [26]:
# Test 1: Question IN our FAQ
print("TEST 1: Question about groceries (IN FAQ)")
q1 = "Where can I buy cheap food?"
answer1, dist1 = find_answer_improved(q1)
print(f"Question: {q1}")
print(f"Answer: {answer1}")
print(f"Distance: {dist1:.2f}\n")

# Test 2: Question NOT in our FAQ
print("TEST 2: Question about wifi (NOT in FAQ)")
q2 = "What is the wifi password?"
answer2, dist2 = find_answer_improved(q2)
print(f"Question: {q2}")
print(f"Answer: {answer2}")
print(f"Distance: {dist2:.2f}\n")

# Test 3: Question NOT in our FAQ
print("TEST 3: Question about parking (NOT in FAQ)")
q3 = "Where can I park my car?"
answer3, dist3 = find_answer_improved(q3)
print(f"Question: {q3}")
print(f"Answer: {answer3}")
print(f"Distance: {dist3:.2f}")

TEST 1: Question about groceries (IN FAQ)
Question: Where can I buy cheap food?
Answer: A: The most affordable grocery stores are No Frills, Food Basics (10% student discount on select days), Walmart, and FreshCo. Avoid Metro and Loblaws as they're more expensive.
Distance: 0.86

TEST 2: Question about wifi (NOT in FAQ)
Question: What is the wifi password?
Answer: I don't have information about that topic in my database. Please ask about: housing, transportation, groceries, work permits, or UHIP.
Distance: 1.74

TEST 3: Question about parking (NOT in FAQ)
Question: Where can I park my car?
Answer: I don't have information about that topic in my database. Please ask about: housing, transportation, groceries, work permits, or UHIP.
Distance: 1.59


In [28]:
# Test: Bilingual questions
q1 = "Where can I buy cheap groceries?"
answer1, dist1 = find_answer(q1)
print(f"Question (English): {q1}")
print(f"Answer: {answer1}")
print(f"Distance: {dist1:.2f}\n")

q2 = "¿Dónde puedo comprar comida barata?"
answer2, dist2 = find_answer(q2)
print(f"Question (Spanish): {q2}")
print(f"Answer: {answer2}")
print(f"Distance: {dist2:.2f}")

Question (English): Where can I buy cheap groceries?
Answer: A: The most affordable grocery stores are No Frills, Food Basics (10% student discount on select days), Walmart, and FreshCo. Avoid Metro and Loblaws as they're more expensive.
Distance: 0.67

Question (Spanish): ¿Dónde puedo comprar comida barata?
Answer: A: The most affordable grocery stores are No Frills, Food Basics (10% student discount on select days), Walmart, and FreshCo. Avoid Metro and Loblaws as they're more expensive.
Distance: 0.94


#### Step 6: Make it Interactive with Streamlit
Task: Use Streamlit to create a simple chatbot UI.


**The complete Streamlit application is in the file** `app.py`

We should run the line: `streamlit run app.py`

In [None]:
import streamlit as st

st.title("Campus FAQ Chatbot")
user_question = st.text_input("Ask your question:")
if user_question:
    q_emb = model.encode([user_question])
    D, I = index.search(np.array(q_emb), k=1)
    st.write("Answer:", lines[I[0][0]])

#### Step 7: Reflection

Questions for students:

**1. How does the chatbot "understand" the question?**

The chatbot converts text into numerical vectors (embeddings) that represent meaning. When you ask a question, it finds the FAQ answer with the closest vector. Similar meanings produce similar vectors, so the bot finds relevant answers even with different wording.

**2. What happens if the user asks something not in the FAQ?**

Without a threshold, it returns the closest match even if irrelevant. We fixed this by setting a distance limit (1.5). If no answer is close enough, the bot honestly says it doesn't have that information and suggests available topics.

**3. How could you improve this system to handle more questions or longer documents?**

To improve the chatbot:
- Add more FAQ topics (scholarships, visa info, campus facilities).
- Split long documents into meaningful chunks instead of just lines.
- Use a language model (GPT/Claude) to generate natural responses.
- Fine-tune the model on campus-specific terminology.
- Add topic categories to filter and improve search accuracy.
