<a href="https://colab.research.google.com/github/meronoumer/learning-agentic-ai-/blob/main/CampusBuzz_RAG_Pipeline.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Coding Exercise: Build a Basic RAG System

## CampusBuzz: A Campus Event Guide Agent

In this exercise, you will build a complete RAG pipeline from scratch. You will take a PDF document, break it into searchable chunks, store those chunks in a vector database, and connect everything to an LLM that can answer questions based on the document's contents.

The document you will work with is the Campus Event Guide for Spring 2026. It contains 9 chapters covering career events, clubs, arts and culture, health and wellness, academic enrichment, social events, key dates, and contact information. By the end of this notebook, you will have a working system that can answer questions like:

- "When is the Spring Career Fair?"
- "What outdoor trips does the Rec Center offer?"
- "How do I start a new club?"

The notebook is organized into five parts. Each part builds on the previous one. Most of the code you need has been covered in the lesson code previews. The final section asks you to do something new on your own.

**Instructions: Run each code cell and complete any "TODO"s**

---
## Setup

Run the cells below to install packages and configure your API key. If you have not set up your Google API key in Colab secrets yet, follow the instructions on the "Step-by-step Setup Guide" page on Canvas.

In [None]:
# Install required packages (this may take a minute, and you may ignore the errors)
!pip install -qU langchain-community pypdf langchain-text-splitters
!pip install -qU langchain-google-genai langchain-community
!pip install -qU --upgrade chromadb

[?25l   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/2.5 MB[0m [31m?[0m eta [36m-:--:--[0m[2K   [91m━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━━━━━━━━━━━━━━━━━━━[0m [32m1.2/2.5 MB[0m [31m30.9 MB/s[0m eta [36m0:00:01[0m[2K   [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m [32m2.5/2.5 MB[0m [31m41.1 MB/s[0m eta [36m0:00:01[0m[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.5/2.5 MB[0m [31m22.0 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m330.6/330.6 kB[0m [31m9.7 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m21.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m64.7/64.7 kB[0m [31m2.3 MB/s[0m eta [36m0:00:00[0m
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m51.0/51.0 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[?25h[31mERROR: pip's dependency res

In [None]:
# Configure your API key (if you get an error, look back at the Step-By-Step Setup Guide Page on Canvas)
import os
from google.colab import userdata
os.environ["GOOGLE_API_KEY"] = userdata.get("GOOGLE_API_KEY")
print("API key configured successfully!")

API key configured successfully!


In [None]:
# Upload the Campus Event Guide PDF
# TODO: Download the Campus_Event_Guide_Spring_2026.pdf file from Canvas and
#       put it in the Colab file system (click on the folder icon on the left
#       and upload the document). Do not put the document into a folder.

# Check to make sure file has been uploaded correctly
# Only proceed if the output of this cell is True
import os.path
os.path.isfile('Campus_Event_Guide_Spring_2026.pdf')

True

In [None]:
loader = PyPDFLoader("Campus_Event_Guide_Spring_2026.pdf")

---
## Part 1: Document Loading

The first step in any RAG pipeline is loading your document. You will use `PyPDFLoader` to read the Campus Event Guide PDF and extract its text content.

**Your task:** Load the PDF and verify it loaded correctly by printing the number of pages and the content of one page.

In [None]:
from langchain_community.document_loaders import PyPDFLoader

# TODO: Create a PyPDFLoader for the file "Campus_Event_Guide_Spring_2026.pdf"
loader = PyPDFLoader("Campus_Event_Guide_Spring_2026.pdf")
# TODO: Load the document
document =loader.load()
# YOUR CODE HERE

# Print the number of pages to verify the load worked
print(f"Loaded {len(document)} pages")

Loaded 15 pages


In [None]:
# Pick a page and print its content to verify the text looks right.
# Try a few different page numbers to see what content is on each page.
page_number = 2  # Change this to explore different pages
print(f"--- Page {page_number} ---")
print(document[page_number].page_content)

--- Page 2 ---
2. Career and Professional Development 
The Career Center hosts events throughout the semester to help you build professional 
skills, explore career paths, and connect with employers. Attendance at Career Center 
events is tracked and can be added to your co-curricular transcript. 
Spring Career Fair 
The Spring Career Fair is the largest recruiting event of the semester. It takes place on 
Wednesday, February 19 from 10am to 3pm in Morrison Auditorium. Over 85 employers 
from technology, finance, healthcare, government, and nonprofit sectors will be present. 
Business professional attire is required. Bring at least 20 printed copies of your resume. 
Pre-registration is required through the Events Portal and opens February 1. Last year, 340 
students attended and 47 received interview invitations within two weeks of the fair. 
Resume Workshop Series* 
Three sessions offered throughout the semester. Session 1 covers resume basics and 
formatting on January 28 from 4-5:30

---
## Part 2: Chunking

Now you need to split the document into smaller chunks that can be searched individually. You will use `RecursiveCharacterTextSplitter`, which splits at natural boundaries like paragraphs and sentences.

**Your task:**
1. Combine all pages into a single text string
2. Create a text splitter and chunk the text
3. Experiment with different chunk sizes
4. Explain your final parameter choices

In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

# First, combine all pages into one text string
full_text = "\n\n".join([page.page_content for page in document])
print(f"Total text length: {len(full_text)} characters")

Total text length: 23073 characters


In [None]:
# TODO: Create a RecursiveCharacterTextSplitter
# Start with chunk_size=500, chunk_overlap=0, and separators=["\n\n", "\n", ". ", " ", ""]
text_splitter =  RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=0, separators=["\n\n", "\n", ". ", " ", ""])




# TODO: Split the full_text into chunks
chunks = text_splitter.split_text(full_text)

# Print the number of chunks and preview a few
print(f"Number of chunks: {len(chunks)}")
print()
for i, chunk in enumerate(chunks[:3]):
    print(f"Chunk {i} has {len(chunk)} characters:")
    print(chunk)
    print()

Number of chunks: 54

Chunk 0 has 384 characters:
Riverdale University 
Campus Event Guide 
Spring 2026 Edition 
Your complete guide to events, activities, and opportunities on campus this semester. 
This is a sample document for use within Break Through Tech's Agentic AI Specialization. It serves as the knowledge 
base for the “Building Your First RAG Pipeline” coding exercise, where you load, chunk, and query this document using

Chunk 1 has 285 characters:
a retrieval-augmented generation pipeline. It is also used for the “Fix This RAG System” coding exercise, where you 
identify issues in a RAG pipeline. It is intended for educational purposes only and does not contain advice that you 
should use for planning events at your university.

Chunk 2 has 450 characters:
1. Welcome and How to Use This Guide 
Welcome to the Spring 2026 semester at Riverdale University. This guide is your central 
resource for everything happening on campus over the next four months. Whether you are 
lookin

In [None]:
# EXPERIMENT: Try different chunk_size values (500, 1000, 1500)
# and different chunk_overlap values (0, 50, 100).
# Run this cell multiple times with different values.

test_splitter = RecursiveCharacterTextSplitter(
    chunk_size=2000,   # Try changing this: 500, 1000, 1500
    chunk_overlap=200,  # Try changing this: 0, 50, 100
    separators=["\n\n", "\n", ". ", " ", ""]
)
test_chunks = test_splitter.split_text(full_text)

print(f"chunk_size=1000, chunk_overlap=50 -> {len(test_chunks)} chunks")
print(f"\nSample chunk (chunk 5):")
print(test_chunks[5] if len(test_chunks) > 5 else test_chunks[-1])

chunk_size=1000, chunk_overlap=50 -> 16 chunks

Sample chunk (chunk 5):
Club Spotlight: Debate Society 
Meets Mondays and Thursdays from 6-7:30pm in SUB Room 302. The team competes in 
regional and national tournaments throughout the semester. Practice sessions on 
Mondays are open to anyone interested. Thursday sessions are for competitive team 
members preparing for upcoming tournaments. The spring tournament schedule includes 
the Regional Qualifier on February 22, the State Championship on March 15, and Nationals 
on April 11-13 in Chicago. No prior debate experience needed to attend Monday practices. 
Starting a New Club 
Students interested in starting a new organization must submit a registration form to the 
Office of Student Life by February 7. Requirements include at least 10 interested student 
members, a faculty or staff advisor, a proposed constitution, and a brief description of the 
organization's purpose. The Student Government Association reviews applications on a 
rol

**Your reflection:** After experimenting, choose the chunk_size and chunk_overlap you want to use for the rest of this notebook. Write 2-3 sentences explaining your choice. What did you notice about how different parameters affected the chunks?

*Double-click this cell to edit. Write your answer below.*

I have found that working with a chunk overlap that is about 10 percent of our chunk size works best in maintaining our meaning. Therefore I am using a chunk overlap of 30 while using a chunk size of 300 as our content isn't a dense textbook  we are looking at a 20 page text where we want detailed but specific queries.

In [None]:
# Now create your final text splitter with your chosen parameters
# and generate the chunks you will use for the rest of the notebook.

final_splitter = RecursiveCharacterTextSplitter(
    chunk_size=300,      # Replace with your chosen value
    chunk_overlap=30,   # Replace with your chosen value
    separators=["\n\n", "\n", ". ", " ", ""]
)
chunks = final_splitter.split_text(full_text)
print(f"Final chunking: {len(chunks)} chunks")

Final chunking: 91 chunks


---
## Part 3: Embedding and Storage

Now you will convert your chunks into embeddings and store them in a vector database. This is what makes your chunks searchable by meaning rather than just keywords.

**Your task:** Create an embedding model and store your chunks in a Chroma vector database.

In [None]:
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain_community.vectorstores import Chroma

# TODO: Create an embedding model using GoogleGenerativeAIEmbeddings
# Use the model "models/gemini-embedding-001"
embeddings = GoogleGenerativeAIEmbeddings(

    model="models/gemini-embedding-001"

)
# TODO: Create a Chroma vector store from your chunks
# Use Chroma.from_texts() with your chunks and embedding model
vectorstore = Chroma.from_texts(

    texts=chunks,

    embedding=embeddings,

    collection_name="my_documents_collection"

)

print(f"Stored {vectorstore._collection.count()} chunks in the vector store")

Stored 91 chunks in the vector store


In [None]:
# Preview what is stored in the vector database
sample = vectorstore._collection.peek(limit=1)
print(f"Sample chunk ID: {sample['ids']}")
print(f"Sample text: {sample['documents']}")
print(f"Embedding length: {len(sample['embeddings'][0])}")

Sample chunk ID: ['ee413a26-167e-425c-8c77-5c0ebc2a879e']
Sample text: ["Riverdale University \nCampus Event Guide \nSpring 2026 Edition \nYour complete guide to events, activities, and opportunities on campus this semester. \nThis is a sample document for use within Break Through Tech's Agentic AI Specialization. It serves as the knowledge"]
Embedding length: 3072


---
## Part 4: Retrieval

With your chunks stored and embedded, you can now search for relevant information. You will run queries against your vector store and examine what comes back.

**Your task:**
1. Run a similarity search and examine the results
2. Run a search with similarity scores
3. Experiment with different values of k

In [None]:
# TODO: Search for chunks similar to this query and use a k value of 5
query = "When is the Spring Career Fair?"
results = vectorstore.similarity_search_with_score(query, k=5)

# Print the results
for i, (doc, score) in enumerate(results):
    print(f"Result {i+1}:")
    print(doc.page_content)
    print(f"Score: {score}")
    print("---")


Result 1:
Spring Career Fair 
The Spring Career Fair is the largest recruiting event of the semester. It takes place on 
Wednesday, February 19 from 10am to 3pm in Morrison Auditorium. Over 85 employers 
from technology, finance, healthcare, government, and nonprofit sectors will be present.
Score: 0.4541863203048706
---
Result 2:
Feb 11 Resume Workshop Session 2 (SUB 204, 4pm) 
Feb 12 Study Abroad Info Session #2 (SUB 302, 4pm) 
Feb 15 Mock Interview registration opens; Pine Ridge hike 
Feb 19 Spring Career Fair (Morrison Aud, 10am-3pm) 
Feb 22 Debate Society Regional Qualifier
Score: 0.5373605489730835
---
Result 3:
Business professional attire is required. Bring at least 20 printed copies of your resume. 
Pre-registration is required through the Events Portal and opens February 1. Last year, 340 
students attended and 47 received interview invitations within two weeks of the fair. 
Resume Workshop Series*
Score: 0.5427034497261047
---
Result 4:
Mock Interview Days 
Held on March 5 a

In [None]:
# TODO: Now search with similarity scores using vectorstore.similarity_search_with_score() with k=5
query = "What outdoor trips are available this semester?"
results_with_scores = vectorstore.similarity_search_with_score(query, k=5)

# Print results with their scores
for doc, score in results_with_scores:
    print(f"Score: {score:.3f}")
    print(doc.page_content)
    print("---")

# EXPERIMENT: Try different values of k (2, 3, 10) with the same query.
# Consider how the results change. Consider if increasing k helps or introduces noise

Score: 0.437
Outdoor Adventure Trips 
The Recreation Center organizes four weekend outdoor trips each semester. Spring trips 
include: a day hike at Pine Ridge State Park on February 15 (beginner-friendly, 6 miles 
round trip), a kayaking trip on Lake Marion on March 22 (no experience necessary,
---
Score: 0.439
instruction provided), a rock climbing day at Stone Valley on April 5 (all skill levels), and a 
camping weekend at Blue Mountain on April 19-20 (two days, one night). Transportation, 
equipment, and meals are provided. Cost is $25 per day trip and $45 for the overnight trip.
---
Score: 0.506
licensed counselors from the Counseling Center. If you are experiencing a mental health 
crisis, contact the Counseling Center directly at 555-0148 or visit during walk-in hours 
(Monday through Friday, 9am to 12pm). 
Outdoor Adventure Trips
---
Score: 0.527
Apr 5 Rock climbing day trip 
Apr 11-13 Debate Nationals (Chicago) 
Apr 19-20 Blue Mountain camping trip 
Apr 25 Undergraduate Resear

---
## Part 5: Generation

The final step is connecting your retriever to an LLM so it can generate answers based on the retrieved chunks. This completes the RAG pipeline.

**Your task:**
1. Create a prompt template
2. Build the RAG chain
3. Test with provided queries
4. Write and test your own queries

In [None]:
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

# Create the LLM
llm = ChatGoogleGenerativeAI(model="gemini-flash-latest")

# TODO: Create a prompt template for CampusBuzz.
# CampusBuzz is not just a Q&A system. It is a friendly, knowledgeable
# campus advisor that helps students get the most out of their semester.
# Your prompt should do three things:
# 1. Give CampusBuzz a persona. How does a great campus advisor talk?
#    Think about tone, enthusiasm, and how they make students feel welcome.
# 2. Constrain the LLM to use only the provided context when answering.
#    The {context} variable contains the retrieved chunks and the
#    {question} variable contains the user's question.
# 3. Tell the LLM what to do when the context does not contain enough
#    information to answer. A good campus advisor does not make things up,
#    but they also do not just say "I don't know." What would they do instead?

template = """You must answer using ONLY the information provided in the context.
If the answer is not clearly supported by the context, say:
"I don't have enough information in the Campus Event Guide to answer that."
Do not use outside knowledge.""
Context:{context}

Question: {question}

Answer: """



prompt = ChatPromptTemplate.from_template(template)

In [None]:
# Helper function to format retrieved documents
def format_docs(docs):
    return "\n\n---\n\n".join(doc.page_content for doc in docs)

# TODO: Create the RAG chain
# Connect the retriever, prompt, LLM, and output parser
# Use vectorstore.as_retriever(search_kwargs={"k": 3}) for the retriever

rag_chain = (
      {"context": vectorstore.as_retriever(search_kwargs={"k": 3}) | format_docs,

     "question": RunnablePassthrough()}

    | prompt

    | llm

    | StrOutputParser()
)

In [None]:
# Test with the provided queries
test_queries = [
    "When is the Spring Career Fair and what should I bring?",
    "What fitness classes are offered at the Recreation Center?",
    "How do I submit work to the Student Art Exhibition?",
    "When is the deadline to register for intramural sports?",
    "What is the process for starting a new student club?"
]

for query in test_queries:
    print(f"Q: {query}")
    response = rag_chain.invoke(query)
    print(f"A: {response}")
    print("=" * 60)

Q: When is the Spring Career Fair and what should I bring?
A: Hello! I am so excited to help you prepare for a fantastic semester! 

The Spring Career Fair will take place on Wednesday, February 19, from 10am to 3pm in Morrison Auditorium. To make a great impression and set yourself up for success, you should bring at least 20 printed copies of your resume! Also, please remember that business professional attire is required for this event. 

I can't wait to see you there making great connections!
Q: What fitness classes are offered at the Recreation Center?
A: Hello there! I am so thrilled to help you have the best semester ever by staying active and energized! 

The Recreation Center has a wonderful variety of fitness classes to help you feel your best! You can join us for:
*   **Yoga** (Monday, Wednesday, and Friday)
*   **HIIT (High Intensity Interval Training)** (Tuesday and Thursday)
*   **Spin classes** (Monday and Wednesday)
*   **Zumba** (Friday)
*   **Saturday morning bootcamp

In [None]:
# For each test query above, check what chunks were actually retrieved.
# Pick one query and inspect the retrieval results.
# This is the habit we discussed in the videos: always check retrieval.

check_query = "What fitness classes are offered at the Recreation Center?"# TODO: place one of the test queries here""
retriever = vectorstore.as_retriever(search_kwargs={"k": 3})
retrieved_docs = retriever.invoke(check_query)

print(f"Query: {check_query}")
print(f"\nRetrieved {len(retrieved_docs)} chunks:\n")
for i, doc in enumerate(retrieved_docs):
    print(f"Chunk {i+1}:")
    print(doc.page_content)
    print("---")

Query: What fitness classes are offered at the Recreation Center?

Retrieved 3 chunks:

Chunk 1:
Center events. 
Group Fitness Schedule 
Group fitness classes run Monday through Saturday in the Recreation Center studios. Yoga 
is offered Monday, Wednesday, and Friday from 7-8am in Studio A. HIIT (High Intensity
---
Chunk 2:
Interval Training) meets Tuesday and Thursday from 12-12:45pm in Studio B. Spin classes 
are Monday and Wednesday from 5:30-6:15pm in the Cycling Room. Zumba is Friday from 
5-6pm in Studio A. Saturday morning bootcamp runs from 9-10am in Studio B. All classes
---
Chunk 3:
5. Health and Wellness 
The Recreation Center and the Counseling Center partner to offer a range of programs 
focused on physical and mental well-being. All fitness classes and wellness workshops are 
free for currently enrolled students. You must bring your student ID to all Recreation
---


---
## Part 6: Your Own Queries and Reflection

Now it is your turn. Write at least **2 of your own queries** that test different aspects of the system. Try to include at least one question that covers a different chapter or topic than the test queries above.

After running your queries, also try one question that the document **cannot** answer. Does the system handle it gracefully?

In [None]:
# TODO: Replace QUERY 1 with your own question
my_query_1 = "I am a computer science major. What acitivities and clubs can I join on campus?"
response_1 = rag_chain.invoke(my_query_1)
print(f"Q: {my_query_1}")
print(f"A: {response_1}")
print("=" * 60)

# TODO: Replace QUERY 2 with your own question
my_query_2 = "Me and my friend are trying to become more athletic and we would like to be part of any fitness groups on campus aside from just the gym. What do you recommend?"
response_2 = rag_chain.invoke(my_query_2)
print(f"Q: {my_query_2}")
print(f"A: {response_2}")
print("=" * 60)

# TODO: Replace QUERY 3 with something the document cannot answer
out_of_scope = "I am wondering what disability inclusion looks like on campus. What support is there for students with disabilities - both cognitive or ambulatory?"
response_oos = rag_chain.invoke(out_of_scope)
print(f"Q: {out_of_scope}")
print(f"A: {response_oos}")

Q: I am a computer science major. What acitivities and clubs can I join on campus?
A: Hello there! I am so excited to help you make this your best semester yet! As a computer science major, you have some fantastic opportunities to dive into tech-related activities right here on campus.

Based on our campus guide, here is what you can check out:

*   **The Innovation Hub:** This is the place to be! It’s the home for **hackathons, tech talks, and maker workshops**, which are perfect for sharpening your skills and meeting fellow tech enthusiasts.
*   **Riverdale Robotics Club:** They meet every **Tuesday from 7-9pm in the Innovation Hub**. They are currently building an autonomous delivery robot! The best part is that they are open to all majors and skill levels, so you’ll fit right in.
*   **Open Practice Sessions:** If you're looking for something competitive, there are practice sessions on **Mondays** that are open to anyone interested in joining the regional and national tournament te

---
**To submit:** Download this notebook as a .ipynb file (File > Download > Download .ipynb) and upload it to Canvas.