This GitHub repo is a senior thesis for Claremont McKenna College; it is a thesis about theses. The project is an interactive RAG-based chatbot that helps students, researchers, and faculty explore Claremont McKenna College senior theses. The goal was to create a domain-specific chatbot to show that it is possible to combat the limitations of AI, including hallucinations, outdated data, and lack of domain expertise. The website link is:
This system combines Groq's LLM, llama3-8b-8192, with a structured SQLite database to provide intelligent, thesis-specific answers. It supports search, analysis, and brainstorming based on metadata from CMC’s thesis archive (through Fall 2024).
Data Collection
- CSVs sourced from Scholarship@Claremont
- Cleaned, merged, and indexed into
theses.dbwith FTS5
Architecture (RAG)
- Retrieve: Matches user query to thesis data (title, author, department, etc.)
- Augment: Adds metadata to the prompt
- Generate: Groq LLM crafts accurate, natural responses
- Format: Returns structured, markdown-friendly output
- (Specifically: The LLM reads the prompt, generates SQL queries, executes them, and formulates the final answer.)
Deployment
- Flask backend + Docker
- CI/CD with GitHub Actions
- Deployed on AWS EC2 with secure HTTPS (Route 53 + ACM)
Features
- Search by title, author, advisor, department, or keyword
- Run analytical queries (e.g., "How many Economics theses in 2020?")
- Get thesis ideas with outlines and advisor suggestions
- Summarize or explain thesis abstracts
- Explore advisor expertise
| CMC Thesis Chatbot | ChatGPT-4o |
|---|---|
![]() |
![]() |
| Generates ideas based on real thesis metadata and advisor matching. | Ideas not grounded in CMC thesis data, risks repetition and lacks specific advisor context. |
| CMC Thesis Chatbot | ChatGPT-4o |
|---|---|
![]() |
![]() |
| Finds actual advisors from the thesis database based on query context. | Returns unrelated advisor info not present in the thesis metadata. |
| CMC Thesis Chatbot | ChatGPT-4o |
|---|---|
![]() |
![]() |
| Searches and summarizes from actual theses in the CMC database. | Can summarize papers but lacks access to full CMC thesis archive. |
- Enhanced query handling for spell correction and edge cases
- Expand metadata coverage (add full thesis text?)
- Responsive UI design for mobile devices
- User authentication for session tracking
- Code optimization
Special thanks to:
- My family (mom, grandma, grandpa, sister) for their unwavering support
- Professor Mike Izbicki for guidance and mentorship
- The CMC community for fostering an environment of academic excellence and innovation
- Claremont Colleges Library for providing access to thesis repository data






