A framework for evaluating LLM responses in your applications using Java Selenium tests with a Python evaluation service powered by Google Gemini.
- 7 Evaluation Metrics: Accuracy, Relevancy, Coherence, Hallucination, Faithfulness, Compliance, Toxicity
- Web Testing Console: Beautiful dark-themed UI for manual testing
- Gemini Powered: Uses Google's Gemini AI for intelligent evaluation
- Java + Python Bridge: Seamless integration between Selenium tests and Python evaluation
- Custom Compliance Rules: Define must-contain/must-not-contain terms
- Detailed Feedback: Get scores and explanations for each metric
βββββββββββββββββββββββ HTTP POST βββββββββββββββββββββββ
β Java Selenium β βββββββββββββββββββΆβ Python FastAPI β
β Test Suite β β Evaluation Serviceβ
β β βββββββββββββββββββ β β
β β’ UI Automation β JSON Response β β’ Gemini AI β
β β’ LLM Response β β β’ Score & Reason β
β Capture β β β’ Pass/Fail β
βββββββββββββββββββββββ βββββββββββββββββββββββ
git clone https://github.com/khaled-yousef-TV/selenium-deepEval-JavaPython.git
cd selenium-deepEval-JavaPythoncd python_deepeval_service
# Create virtual environment
python3 -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
# Install dependencies
pip install fastapi uvicorn pydantic google-generativeai python-dotenv
# Configure your Gemini API key
echo "GEMINI_API_KEY=your-api-key-here" > .envπ Get a free Gemini API key at: https://aistudio.google.com/apikey
python -m uvicorn app.main:app --reload --port 8000Navigate to: http://localhost:8000
| Metric | Description | Use Case |
|---|---|---|
| Accuracy | How correct is the response? | Fact-checking |
| Relevancy | Does it answer the question? | Q&A validation |
| Coherence | Is it well-structured? | Quality assurance |
| Hallucination | Does it make things up? | Fact verification |
| Faithfulness | Is it faithful to context? | RAG evaluation |
| Compliance | Does it meet custom rules? | Policy enforcement |
| Toxicity | Is it safe & appropriate? | Content moderation |
GET /healthPOST /evaluate
Content-Type: application/json
{
"llm_response": "Paris is the capital of France.",
"expected_output": "Paris",
"user_query": "What is the capital of France?",
"context": "France is a country in Europe. Its capital is Paris.",
"metrics": ["accuracy", "relevancy", "hallucination"],
"custom_criteria": {
"must_contain": ["Paris"],
"must_not_contain": ["guaranteed"]
}
}{
"success": true,
"passed": true,
"overall_score": 0.95,
"metrics": {
"accuracy": {
"name": "accuracy",
"score": 1.0,
"threshold": 0.7,
"passed": true,
"reason": "The response is perfectly accurate..."
}
}
}POST /evaluate/quick?llm_response=...&expected_output=...&min_score=0.7POST /evaluate/hallucination
{
"llm_response": "...",
"context": "..."
}dependencies {
implementation 'org.seleniumhq.selenium:selenium-java:4.+'
implementation 'com.squareup.okhttp3:okhttp:4.12.0'
implementation 'com.fasterxml.jackson.core:jackson-databind:2.15.2'
testImplementation 'org.testng:testng:7.+'
}public class LLMQualityTest {
private static final String EVAL_URL = "http://127.0.0.1:8000/evaluate";
@Test
public void testChatbotResponseQuality() throws Exception {
// 1. Selenium captures LLM response from your app
String llmResponse = driver.findElement(By.id("chatResponse")).getText();
// 2. Send to evaluation service
String json = """
{
"llm_response": "%s",
"user_query": "What are your business hours?",
"expected_output": "9 AM to 5 PM",
"metrics": ["accuracy", "relevancy"]
}
""".formatted(llmResponse);
Request request = new Request.Builder()
.url(EVAL_URL)
.post(RequestBody.create(json, MediaType.parse("application/json")))
.build();
// 3. Assert on evaluation results
Response response = httpClient.newCall(request).execute();
EvalResult result = objectMapper.readValue(response.body().string(), EvalResult.class);
assertTrue(result.passed, "LLM response quality check failed");
assertTrue(result.overall_score > 0.7, "Score below threshold");
}
}.
βββ python_deepeval_service/
β βββ app/
β β βββ main.py # FastAPI endpoints
β β βββ models.py # Pydantic models
β β βββ evaluator.py # Gemini evaluation logic
β βββ static/
β β βββ index.html # Testing Console UI
β βββ requirements.txt
β βββ .env # API key (git-ignored)
β
βββ java_selenium_module/
β βββ src/test/java/ # Selenium tests
β βββ build.gradle
β
βββ .gitignore
βββ README.md
- API keys are stored in
.envfiles (git-ignored) - Never commit API keys to version control
- The
.gitignoreincludes:.env,venv/,__pycache__/
The web UI at http://localhost:8000 provides:
- π Dark theme with gradient accents
- π Input fields for response, query, expected output, context
- βοΈ Metric selection checkboxes
- π Custom compliance rules
- π Visual score bars with pass/fail indicators
- π¬ Detailed reasoning for each metric
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
MIT License - see LICENSE for details.
Built with β€οΈ by Khaled Yousef