A Retrieval-Augmented Generation (RAG) chatbot that answers questions about Claremont McKenna College senior theses using real database records — not general model knowledge.
🌐 https://cmcthesischatbot.com
This system translates natural language questions into structured database queries and semantic searches over CMC thesis metadata.
It can:
- Search theses by title, topic, advisor, department, or year
- Rank advisors by topic expertise
- Filter by award, season, or publication date
- Summarize real abstracts
- Generate thesis ideas grounded in actual CMC data
Unlike ChatGPT, every response is backed by records from the CMC thesis archive.
Backend: Flask
Database: SQLite (theses2.db)
Embeddings: SentenceTransformers (all-MiniLM-L6-v2)
Vector Store: ChromaDB (persistent, cosine similarity)
LLM: Groq llama-3.1-8b-instant
Deployment: Docker + AWS EC2
CI/CD: GitHub Actions
classify() → fetch() → respond()
(LLM) (no LLM) (LLM)
LLM extracts:
- intent (title lookup, topic search, aggregation, person lookup)
- entities (names, topics, filters)
Pure retrieval:
- SQL for structured queries
- Vector search (ChromaDB) for semantic topic matching
- Hybrid (SQL filter → vector re-rank) for constrained topic queries
LLM formats only the retrieved records into a grounded answer.
It is explicitly instructed not to use outside knowledge.
- Topic-based advisor rankings use vector search first, then count advisors within real results
- Prompts include strict grounding instructions
- Advisor and author lookups run separate SQL queries
The model never invents advisors or theses.
Source: Scholarship@Claremont
Stored in SQLite with fields including:
- Title
- Author(s)
- Advisor(s)
- Department(s)
- Abstract
- Keywords
- Award
- Publication date
- Season
- URL
| CMC Thesis Chatbot | ChatGPT-4o |
|---|---|
![]() |
![]() |
| CMC Thesis Chatbot | ChatGPT-4o |
|---|---|
![]() |
![]() |
| CMC Thesis Chatbot | ChatGPT-4o |
|---|---|
![]() |
![]() |
rag_system17.py # classify → fetch → respond
config.yaml # acronym expansion
theses2.db # SQLite database
chroma_store/ # persistent vector index
screenshots/
Dockerfile
.github/workflows/deploy.yml
pip install chromadb sentence-transformers groq numpy pyyaml flask
# Build vector index (run once)
python rag_system17.py --build-index
# Start chatbot
python rag_system17.pySet API key:
export GROQ_API_KEY=your_key_here- Web UI (currently CLI)
- Caching layer
- Full-text thesis embeddings
- Multi-turn conversation support






