Java-Native RAG System

A complete, full-stack Retrieval-Augmented Generation (RAG) application for healthcare semantic search using the MedQuAD dataset.

🏗️ Architecture

Backend: Java Spring Boot 3.x with Spring AI
Vector Database: PostgreSQL with pgvector extension
Frontend: React + TypeScript with Vite
Embeddings & LLM: OpenAI (text-embedding-ada-002 & GPT-4)
Evaluation: Python RAGAS framework

📁 Project Structure

java-rag-semantic-search/
├── backend/              # Spring Boot application
│   ├── src/
│   │   └── main/
│   │       ├── java/com/vardhan/rag/
│   │       │   ├── RagApplication.java
│   │       │   ├── dto/
│   │       │   ├── service/
│   │       │   └── controller/
│   │       └── resources/
│   │           └── application.properties
│   ├── pom.xml
│   └── Dockerfile
├── frontend/             # React TypeScript app
│   ├── src/
│   │   ├── components/
│   │   │   ├── SearchUI.tsx
│   │   │   └── BenchmarkDashboard.tsx
│   │   ├── App.tsx
│   │   └── main.tsx
│   ├── package.json
│   ├── Dockerfile
│   └── nginx.conf
├── data/                 # Dataset files
│   ├── data_prep.py
│   └── README.md
├── research/             # Evaluation scripts
│   ├── evaluate.py
│   ├── eval_dataset.csv
│   ├── requirements.txt
│   └── README.md
├── docker-compose.yml
└── README.md

🚀 Quick Start

Prerequisites

Java 21+
Node.js 20+
Docker & Docker Compose
Python 3.10+
OpenAI API Key

1. Clone and Setup

cd "c:\Projects\Java-Native RAG System"

2. Set Environment Variables

Create a .env file in the root directory:

OPENAI_API_KEY=your-openai-api-key-here

3. Prepare Data

Download the MedQuAD dataset from Kaggle and place it in the data/ directory:

cd data
pip install pandas
python data_prep.py

4. Start with Docker

docker-compose up --build

This will start:

PostgreSQL with pgvector on port 5432
Spring Boot backend on port 8080
React frontend on port 3000

5. Access the Application

Frontend: http://localhost:3000
Backend API: http://localhost:8080/api
Health Check: http://localhost:8080/api/rag/health

📊 Using the Application

1. Ingest Data

First, you need to ingest the MedQuAD data into the vector store:

Navigate to the Search UI
Click the "Ingest Data" button
Wait for the process to complete

2. Ask Questions

Once data is ingested, you can ask medical questions:

Type your question in the input field
Click "Send"
The system will retrieve relevant context and generate an answer

3. Run Benchmarks

To evaluate RAG performance:

Install Python dependencies:

cd research
pip install -r requirements.txt

Navigate to /benchmark in the frontend
Click "Run Benchmark"
View RAGAS metrics and visualizations

🔧 Development

Backend Development

cd backend
./mvnw spring-boot:run

Frontend Development

cd frontend
npm install
npm run dev

Python Evaluation

cd research
pip install -r requirements.txt
python evaluate.py

📈 RAGAS Metrics

The system evaluates RAG performance using:

Faithfulness: Factual consistency with context (0-1)
Answer Relevancy: How well answers address questions (0-1)
Context Relevancy: Relevance of retrieved documents (0-1)
Context Recall: Completeness of retrieved information (0-1)

🛠️ Technology Stack

Backend

Spring Boot 3.2.5
Spring AI 1.0.0-M1
PostgreSQL + pgvector
OpenAI API
Lombok

Frontend

React 18
TypeScript
Vite
Axios
Recharts
React Router

Research

RAGAS
LangChain
Pandas
Datasets

📝 API Endpoints

RAG Controller

POST /api/rag/query - Query the RAG system
POST /api/rag/ingest - Ingest data into vector store
GET /api/rag/health - Health check

Benchmark Controller

GET /api/benchmark/run - Run RAGAS evaluation
GET /api/benchmark/status - Check benchmark service status

🐳 Docker Commands

# Start all services
docker-compose up -d

# View logs
docker-compose logs -f

# Stop all services
docker-compose down

# Rebuild and start
docker-compose up --build

# Remove volumes (reset database)
docker-compose down -v

🔍 Troubleshooting

Backend Issues

Can't connect to database: Ensure PostgreSQL container is running
OpenAI API errors: Check your API key in environment variables
Data ingestion fails: Verify medquad.json exists in the data directory

Frontend Issues

Can't connect to backend: Check backend is running on port 8080
Build errors: Delete node_modules and run npm install

Evaluation Issues

Python dependencies: Install all requirements: pip install -r research/requirements.txt
RAGAS errors: Ensure OpenAI API key is set and valid

📚 Dataset

The MedQuAD dataset contains medical Q&A pairs from authoritative sources:

National Institutes of Health (NIH)
Centers for Disease Control (CDC)
Food and Drug Administration (FDA)

Download from: https://www.kaggle.com/datasets/gpreda/medquad

🤝 Contributing

This is a research project. Feel free to:

Add more evaluation metrics
Experiment with different embedding models
Improve the UI/UX
Add more medical datasets

📄 License

This project is for educational and research purposes.

🙏 Acknowledgments

Spring AI team for the excellent framework
RAGAS team for the evaluation framework
MedQuAD dataset creators

Built with ❤️ for healthcare AI research

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
backend		backend
data		data
frontend		frontend
research		research
.env.template		.env.template
.gitignore		.gitignore
CHECKLIST.md		CHECKLIST.md
FILE_TREE.md		FILE_TREE.md
PROJECT_SUMMARY.md		PROJECT_SUMMARY.md
README.md		README.md
SETUP.md		SETUP.md
docker-compose.yml		docker-compose.yml
start.ps1		start.ps1
start.sh		start.sh

grammerpro/Java-Native-RAG-System

Folders and files

Latest commit

History

Repository files navigation