Testing framework for biomedical RAG applications via FastAPI endpoints. Generate questions from study abstracts and evaluate API performance.
- Generate test questions from biomedical abstracts CSV
- Test DugBot/BDCBot APIs via HTTP endpoints
- Performance evaluation and reporting
- RAGAS evaluation for answer quality (context recall, faithfulness, etc.)
- Question types: factual, analytical, comparative, unanswerable
pip install -r requirements.txtTo Configure the application, please copy .env-template to a new .env file and modify the appropriate variables.
When the program starts it will load in .env file automatically.
- Load Abstracts → CSV with study abstracts
- Generate Questions → 4 types using configurable LLM (factual, analytical, comparative, unanswerable)
- Test API → Send questions to DUGBot endpoint, track performance
- Store Results → Raw API responses with timing and status
- Compute Basic Metrics → Success rate, response time, error analysis
- Run RAGAS → Answer quality evaluation using OpenAI or Ollama (faithfulness, context recall, etc.)
- Generate Report → Combined performance + quality assessment
# From JSON file
python main.py generate documents.json -o questions.json -n 40
# From CSV file
python main.py generate abstracts.csv -o questions.json -n 40python main.py test questions.json -o test_results -rpython main.py evaluate test_results_results.json -o evaluation
python main.py evaluate test_results_results.json --with-ragas # enable RAGAS evaluationpython main.py compare test1_results.json test2_results.json -o comparisontesting_framework/
├── main.py # CLI interface
├── config.py # Configuration
├── qa_generator.py # Question generation from abstracts
├── api_tester.py # API testing (DugBot and BDCBot)
├── evaluator.py # Results evaluation
├── data_processor.py # Data loading/saving
├── format_converter.py # Dataset format conversion
├── requirements.txt # Dependencies
└── results/ # Generated results
[
{
"ID": "study_001",
"CONTEXT": "This study examines the ..."
},
{
"ID": "study_002",
"CONTEXT": "The C4R study ..."
}
]- Question datasets in JSON format
- API test results with response times and success rates
- Evaluation reports with performance metrics
- Comparison analysis across multiple tests (this is just an idea to track the performance for a given period)
Question generation uses configurable LLM (default: Ollama/Llama) for specific biomedical question types.
Simple document list:
[
{"ID": "doc1", "CONTEXT": "The C4R studies are ..."},
{"ID": "doc2", "CONTEXT": "The Covid studies..."}
]With metadata wrapper:
{
"documents": [
{"ID": "doc1", "CONTEXT": "The C4R studies are..."},
{"ID": "doc2", "CONTEXT": "The Covid studies..."}
]
}- Default: Ollama with Llama 3.1 OR "gemma3:12b"
- Configurable: Any Ollama-compatible model that are limited to GPU resources on Sterling (Cluster)
- Purpose: Generate test questions from abstracts
- Option 1: OpenAI GPT-4
- Option 2: Ollama with Llama 3.1/Gemma3:12b
- Purpose: Evaluate answer quality with RAGAS metrics
OpenAI for RAGAS :
RAGAS_EVALUATION_LLM_PROVIDER = "openai"
RAGAS_EVALUATION_LLM_API_KEY = "your-openai-key"Ollama for RAGAS :
RAGAS_EVALUATION_LLM_PROVIDER = "ollama"
RAGAS_EVALUATION_LLM_MODEL = "llama3.1:latest" OR "gemma3:12b"