Comprehensive Chat Agent Overhaul, Testing Consolidation, and Evaluation Methodology by kargig · Pull Request #177 · kargig/divemap

kargig · 2026-03-09T21:42:47Z

Summary

This PR delivers a massive upgrade to the agentic chat subsystem, improving its accuracy, contextual awareness, and reliability. It transitions the LLM away from ambiguous, "catch-all" tools towards highly specific tool schemas with dedicated Python executors, resulting in a strict upgrade to chat quality (achieving a 100% pass rate on quantitative evaluations).

Additionally, this PR completely reorganizes the fragmented chat test suite for better maintainability, resolves several critical runtime bugs, and introduces a robust, double-blind LLM-as-a-judge evaluation methodology for future chat development.

Changes Made

🤖 Agent & Prompt Engineering Improvements

Specific Tool Schemas: Replaced generic tools (like search_certifications) with explicit, dedicated tools: compare_certifications, get_certification_path, get_dive_site_details, search_diving_trips, get_user_dive_logs, and get_reviews_and_comments.
Anti-Hallucination Guardrails: Updated the system prompt to strictly forbid the LLM from hallucinating coordinates for locations. The system now enforces reliance on the backend's deterministic Nominatim geocoding.
Enhanced Formatting: Instructed the LLM to automatically categorize lists of dive sites and explicitly include crucial metadata like max_depth, difficulty, and shore_direction.
Proactive Cross-Referencing: The LLM now proactively cross-references local Diving Centers when recommending dive sites.
Confident Fallbacks: Replaced rigid, apologetic error messages with confident fallback suggestions (e.g., pivoting to diving centers when dive sites aren't found).
Page Context Enrichment: Fixed the page context resolver to correctly map dive_site.name and inject rich physics metadata (depths, duration, serialized gas info) into the context window, enabling seamless SAC calculations on specific dive log pages.

🗺️ Dynamic Geocoding & Spatial Search

Dynamic Search Radius: Replaced the hardcoded 100km fallback radius with a dynamic calculation using the Haversine formula (calculate_distance in geo_utils.py). The search radius now scales proportionally to the size of the Nominatim bounding box (clamped between 5km and 200km).
Scope Fix: Resolved an UnboundLocalError in discovery.py by moving get_empirical_region_bounds and get_external_region_bounds imports to the module level.

🎯 Intent Extraction Refinements

Dedicated Executors: Shifted from a massive monolithic others.py conditional block to dedicated executor modules (user_data.py and reviews.py), respecting global privacy settings like disable_diving_center_reviews.
Gear Rental (SearchGearRentalTool): Created a dedicated tool schema. Updated the fallback logic to explicitly join(GearRentalCost) to only recommend centers verified to offer rentals.
Career Path (CAREER_PATH): Implemented regex-based tokenization and stop-word filtering to accurately extract specific certification entities.
Comparisons (COMPARISON): Improved the sorting heuristic to prioritize exact whole-word matches over partial substring matches. Increased the result cap from 10 to 20 to prevent real database seed data from crowding out exact matches.

🐛 Bug Fixes & Reliability

Fixed shadowing bug in discovery executor where the date argument conflicted with the date class import, crashing trip searches.
Resolved missing argument in weather enrichment pipeline by correctly passing intent_location to the enricher function.
Updated ChatIntermediateAction schema to capture tool_name and raw tool_result, enabling a high-fidelity audit trail.

🧪 Testing & Evaluation Methodology

Suite Reorganization: Merged 11 scattered and overlapping test files into 3 logically organized files (test_chat_agent.py, test_chat_executors.py, test_chat_api.py) that align with the system's modular architecture.
Fixture Fixes: Resolved data-dependency bugs in recommendation fixtures to ensure reliable test execution.
LLM-as-a-Judge Evaluation Pipeline: Introduced analyze_chat_quality_diff.py and evaluate_qualitative.py scripts to automate double-blind A/B testing of chatbot responses.
Documentation: Added comprehensive Markdown documentation in docs/development/chat_evaluation_methodology.md establishing the standard operating procedure for running the new evaluation pipeline.

Testing

Automated Tests: Ran the full backend test suite (./docker-test-github-actions.sh). All tests pass (1430/1430). The newly consolidated test files correctly handle DB seed data overlaps.
Qualitative LLM Evaluation: Ran a comprehensive, double-blind A/B evaluation against 38 real-world prompts comparing the old architecture to this branch. This branch achieved a 100% quantitative pass rate and won the qualitative assessment 28-10, fixing previous regressions in Gear Rental, Certifications, and Regional Searches.
Manual Verification: Manually tested complex prompts via a local testing script to verify new tools (like get_user_dive_logs and get_reviews_and_comments) correctly invoked backend logic and respected privacy constraints.

Related Issues

Resolves intent parsing regressions and geocoding hallucination issues identified during the chat architecture migration.

Additional Notes

Reviewers: Pay special attention to the new specialized tools in tools.py and the dynamic Haversine radius calculation in discovery.py.
Deployment: No database migrations are required. The backend environment must have access to outbound internet for Nominatim API calls (already standard for geo_utils), and DEEPSEEK_API_KEY must be configured to run the new evaluation methodology scripts.

Enhance the agentic chat subsystem by refining the tool execution history metadata and resolving critical bugs identified during quality evaluation. Key Improvements: - Update ChatIntermediateAction schema to capture tool_name and raw tool_result, enabling a high-fidelity audit trail for AI reasoning. - Fix shadowing bug in discovery executor where the 'date' argument conflicted with the 'date' class import, crashing trip searches. - Resolve missing argument in weather enrichment pipeline by correctly passing intent_location to the enricher function. - Prevent fuzzy site resolution from blocking subsequent text filters by preserving the location parameter after coordinate resolution. - Enforce sensible PPO2 defaults in tool schemas and system prompts to prevent unnecessary LLM clarification loops. Testing & Reliability: - Introduce test_chat_agent_integration.py to verify the wiring between LLM tool calls and Python backend logic. - Introduce test_chat_agent_comprehensive.py to validate complex edge cases including fuzzy name resolution and physics calculations. - Update ENTITY_ICONS imports in base executor to resolve NameErrors.

Overhaul the chat testing architecture to reduce fragmentation and improve maintainability. Merged 11 scattered test files into 3 logically organized primary files that align with the system's modular design. Key Changes: - Create test_chat_agent.py: Focused on the ReAct loop, tool calling logic, context resolution, and fuzzy location name mapping. - Create test_chat_executors.py: Focused on backend capability logic including spatial bounding boxes, directions, ratings, and physics. - Update test_chat_api.py: Maintained as the high-level REST endpoint and session management validation suite. - Remove 9 redundant and overlapping test files to eliminate clutter. - Fix data-dependency bugs in recommendation fixtures to ensure reliable test execution in isolated environments. This reorganization provides a clear map for future test development and ensures 70%+ coverage on critical chat service components.

- Add `calculate_distance` using Haversine formula to compute dynamic search radius based on bounding box size, replacing the hardcoded 100km fallback. - Update system prompt in `chat_service.py` to prevent LLM coordinate hallucinations for regions/cities, enforcing reliance on Nominatim. - Resolve `UnboundLocalError` in `discovery.py` by promoting geocoding imports to the module level. - Introduce `SearchGearRentalTool` to handle specific gear rental intents and filter fallback diving centers strictly by `GearRentalCost` existence. - Refine `CAREER_PATH` execution with regex tokenization and stop words to accurately extract certification entities. - Enhance `COMPARISON` intent sorting to prioritize exact word matches, resolving overlapping mock data issues in `test_comparison_logic`. - Increase data limits for discovery and comparison intents to provide the LLM with a denser context window.

Remove the generic `search_certifications` tool in favor of highly specific schemas (`compare_certifications`, `get_certification_path`, `get_dive_site_details`, and `search_diving_trips`) to eliminate LLM confusion and regex parsing in the backend. Add `get_user_dive_logs` and `get_reviews_and_comments` tools, routing them to new, dedicated executor modules (`user_data.py` and `reviews.py`). This allows the LLM to analyze personal logbooks and community feedback while strictly respecting the global `disable_diving_center_reviews` privacy setting. Fix the page context resolver to correctly map `dive_site.name` and inject rich physics metadata (depths, duration, serialized gas info) into the context window so the LLM can seamlessly perform SAC calculations on specific dive logs.

Add `analyze_chat_quality_diff.py` and `evaluate_qualitative.py` scripts to automate double-blind A/B testing of chatbot responses using an LLM as a judge, ensuring quantitative and qualitative regressions are caught. Update `evaluate_chat_quality.py` to fix typo 'Athens' -> 'Attica' in test prompt for gear rental validation. Add comprehensive Markdown documentation in `docs/development/chat_evaluation_methodology.md` establishing the standard operating procedure for running the new evaluation pipeline.

kargig added 4 commits March 8, 2026 20:50

kargig changed the title ~~Fix chat intent parsing, dynamic geocoding, and consolidate testing~~ Comprehensive Chat Agent Overhaul, Testing Consolidation, and Evaluation Methodology Mar 9, 2026

kargig force-pushed the fix/improve_chat_tests branch 2 times, most recently from b6eb317 to d16a648 Compare March 10, 2026 07:51

kargig force-pushed the main branch from 6f2083b to e9242c2 Compare March 10, 2026 08:16

kargig merged commit 7e8e4c2 into main Mar 10, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comprehensive Chat Agent Overhaul, Testing Consolidation, and Evaluation Methodology#177

Comprehensive Chat Agent Overhaul, Testing Consolidation, and Evaluation Methodology#177
kargig merged 5 commits intomainfrom
fix/improve_chat_tests

kargig commented Mar 9, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kargig commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes Made

🤖 Agent & Prompt Engineering Improvements

🗺️ Dynamic Geocoding & Spatial Search

🎯 Intent Extraction Refinements

🐛 Bug Fixes & Reliability

🧪 Testing & Evaluation Methodology

Testing

Related Issues

Additional Notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kargig commented Mar 9, 2026 •

edited

Loading