Skip to content

Comprehensive Chat Agent Overhaul, Testing Consolidation, and Evaluation Methodology#177

Merged
kargig merged 5 commits intomainfrom
fix/improve_chat_tests
Mar 10, 2026
Merged

Comprehensive Chat Agent Overhaul, Testing Consolidation, and Evaluation Methodology#177
kargig merged 5 commits intomainfrom
fix/improve_chat_tests

Conversation

@kargig
Copy link
Owner

@kargig kargig commented Mar 9, 2026

Summary

This PR delivers a massive upgrade to the agentic chat subsystem, improving its accuracy, contextual awareness, and reliability. It transitions the LLM away from ambiguous, "catch-all" tools towards highly specific tool schemas with dedicated Python executors, resulting in a strict upgrade to chat quality (achieving a 100% pass rate on quantitative evaluations).

Additionally, this PR completely reorganizes the fragmented chat test suite for better maintainability, resolves several critical runtime bugs, and introduces a robust, double-blind LLM-as-a-judge evaluation methodology for future chat development.

Changes Made

🤖 Agent & Prompt Engineering Improvements

  • Specific Tool Schemas: Replaced generic tools (like search_certifications) with explicit, dedicated tools: compare_certifications, get_certification_path, get_dive_site_details, search_diving_trips, get_user_dive_logs, and get_reviews_and_comments.
  • Anti-Hallucination Guardrails: Updated the system prompt to strictly forbid the LLM from hallucinating coordinates for locations. The system now enforces reliance on the backend's deterministic Nominatim geocoding.
  • Enhanced Formatting: Instructed the LLM to automatically categorize lists of dive sites and explicitly include crucial metadata like max_depth, difficulty, and shore_direction.
  • Proactive Cross-Referencing: The LLM now proactively cross-references local Diving Centers when recommending dive sites.
  • Confident Fallbacks: Replaced rigid, apologetic error messages with confident fallback suggestions (e.g., pivoting to diving centers when dive sites aren't found).
  • Page Context Enrichment: Fixed the page context resolver to correctly map dive_site.name and inject rich physics metadata (depths, duration, serialized gas info) into the context window, enabling seamless SAC calculations on specific dive log pages.

🗺️ Dynamic Geocoding & Spatial Search

  • Dynamic Search Radius: Replaced the hardcoded 100km fallback radius with a dynamic calculation using the Haversine formula (calculate_distance in geo_utils.py). The search radius now scales proportionally to the size of the Nominatim bounding box (clamped between 5km and 200km).
  • Scope Fix: Resolved an UnboundLocalError in discovery.py by moving get_empirical_region_bounds and get_external_region_bounds imports to the module level.

🎯 Intent Extraction Refinements

  • Dedicated Executors: Shifted from a massive monolithic others.py conditional block to dedicated executor modules (user_data.py and reviews.py), respecting global privacy settings like disable_diving_center_reviews.
  • Gear Rental (SearchGearRentalTool): Created a dedicated tool schema. Updated the fallback logic to explicitly join(GearRentalCost) to only recommend centers verified to offer rentals.
  • Career Path (CAREER_PATH): Implemented regex-based tokenization and stop-word filtering to accurately extract specific certification entities.
  • Comparisons (COMPARISON): Improved the sorting heuristic to prioritize exact whole-word matches over partial substring matches. Increased the result cap from 10 to 20 to prevent real database seed data from crowding out exact matches.

🐛 Bug Fixes & Reliability

  • Fixed shadowing bug in discovery executor where the date argument conflicted with the date class import, crashing trip searches.
  • Resolved missing argument in weather enrichment pipeline by correctly passing intent_location to the enricher function.
  • Updated ChatIntermediateAction schema to capture tool_name and raw tool_result, enabling a high-fidelity audit trail.

🧪 Testing & Evaluation Methodology

  • Suite Reorganization: Merged 11 scattered and overlapping test files into 3 logically organized files (test_chat_agent.py, test_chat_executors.py, test_chat_api.py) that align with the system's modular architecture.
  • Fixture Fixes: Resolved data-dependency bugs in recommendation fixtures to ensure reliable test execution.
  • LLM-as-a-Judge Evaluation Pipeline: Introduced analyze_chat_quality_diff.py and evaluate_qualitative.py scripts to automate double-blind A/B testing of chatbot responses.
  • Documentation: Added comprehensive Markdown documentation in docs/development/chat_evaluation_methodology.md establishing the standard operating procedure for running the new evaluation pipeline.

Testing

  • Automated Tests: Ran the full backend test suite (./docker-test-github-actions.sh). All tests pass (1430/1430). The newly consolidated test files correctly handle DB seed data overlaps.
  • Qualitative LLM Evaluation: Ran a comprehensive, double-blind A/B evaluation against 38 real-world prompts comparing the old architecture to this branch. This branch achieved a 100% quantitative pass rate and won the qualitative assessment 28-10, fixing previous regressions in Gear Rental, Certifications, and Regional Searches.
  • Manual Verification: Manually tested complex prompts via a local testing script to verify new tools (like get_user_dive_logs and get_reviews_and_comments) correctly invoked backend logic and respected privacy constraints.

Related Issues

  • Resolves intent parsing regressions and geocoding hallucination issues identified during the chat architecture migration.

Additional Notes

  • Reviewers: Pay special attention to the new specialized tools in tools.py and the dynamic Haversine radius calculation in discovery.py.
  • Deployment: No database migrations are required. The backend environment must have access to outbound internet for Nominatim API calls (already standard for geo_utils), and DEEPSEEK_API_KEY must be configured to run the new evaluation methodology scripts.

kargig added 4 commits March 8, 2026 20:50
Enhance the agentic chat subsystem by refining the tool execution history
metadata and resolving critical bugs identified during quality
evaluation.

Key Improvements:
- Update ChatIntermediateAction schema to capture tool_name and raw
  tool_result, enabling a high-fidelity audit trail for AI reasoning.
- Fix shadowing bug in discovery executor where the 'date' argument
  conflicted with the 'date' class import, crashing trip searches.
- Resolve missing argument in weather enrichment pipeline by correctly
  passing intent_location to the enricher function.
- Prevent fuzzy site resolution from blocking subsequent text filters by
  preserving the location parameter after coordinate resolution.
- Enforce sensible PPO2 defaults in tool schemas and system prompts to
  prevent unnecessary LLM clarification loops.

Testing & Reliability:
- Introduce test_chat_agent_integration.py to verify the wiring between
  LLM tool calls and Python backend logic.
- Introduce test_chat_agent_comprehensive.py to validate complex edge
  cases including fuzzy name resolution and physics calculations.
- Update ENTITY_ICONS imports in base executor to resolve NameErrors.
Overhaul the chat testing architecture to reduce fragmentation and improve
maintainability. Merged 11 scattered test files into 3 logically organized
primary files that align with the system's modular design.

Key Changes:
- Create test_chat_agent.py: Focused on the ReAct loop, tool calling logic,
  context resolution, and fuzzy location name mapping.
- Create test_chat_executors.py: Focused on backend capability logic
  including spatial bounding boxes, directions, ratings, and physics.
- Update test_chat_api.py: Maintained as the high-level REST endpoint and
  session management validation suite.
- Remove 9 redundant and overlapping test files to eliminate clutter.
- Fix data-dependency bugs in recommendation fixtures to ensure reliable
  test execution in isolated environments.

This reorganization provides a clear map for future test development and
ensures 70%+ coverage on critical chat service components.
- Add `calculate_distance` using Haversine formula to compute dynamic search radius based on bounding box size, replacing the hardcoded 100km fallback.
- Update system prompt in `chat_service.py` to prevent LLM coordinate hallucinations for regions/cities, enforcing reliance on Nominatim.
- Resolve `UnboundLocalError` in `discovery.py` by promoting geocoding imports to the module level.
- Introduce `SearchGearRentalTool` to handle specific gear rental intents and filter fallback diving centers strictly by `GearRentalCost` existence.
- Refine `CAREER_PATH` execution with regex tokenization and stop words to accurately extract certification entities.
- Enhance `COMPARISON` intent sorting to prioritize exact word matches, resolving overlapping mock data issues in `test_comparison_logic`.
- Increase data limits for discovery and comparison intents to provide the LLM with a denser context window.
Remove the generic `search_certifications` tool in favor of highly
specific schemas (`compare_certifications`, `get_certification_path`,
`get_dive_site_details`, and `search_diving_trips`) to eliminate LLM
confusion and regex parsing in the backend.

Add `get_user_dive_logs` and `get_reviews_and_comments` tools, routing
them to new, dedicated executor modules (`user_data.py` and
`reviews.py`). This allows the LLM to analyze personal logbooks and
community feedback while strictly respecting the global
`disable_diving_center_reviews` privacy setting.

Fix the page context resolver to correctly map `dive_site.name` and
inject rich physics metadata (depths, duration, serialized gas info)
into the context window so the LLM can seamlessly perform SAC
calculations on specific dive logs.
@kargig kargig changed the title Fix chat intent parsing, dynamic geocoding, and consolidate testing Comprehensive Chat Agent Overhaul, Testing Consolidation, and Evaluation Methodology Mar 9, 2026
Add `analyze_chat_quality_diff.py` and `evaluate_qualitative.py` scripts
to automate double-blind A/B testing of chatbot responses using an LLM
as a judge, ensuring quantitative and qualitative regressions are caught.

Update `evaluate_chat_quality.py` to fix typo 'Athens' -> 'Attica' in
test prompt for gear rental validation.

Add comprehensive Markdown documentation in
`docs/development/chat_evaluation_methodology.md` establishing the standard
operating procedure for running the new evaluation pipeline.
@kargig kargig force-pushed the fix/improve_chat_tests branch 2 times, most recently from b6eb317 to d16a648 Compare March 10, 2026 07:51
@kargig kargig merged commit 7e8e4c2 into main Mar 10, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant