improve language support for search#183
Conversation
📝 WalkthroughWalkthroughPostgres-backed discovery and cached resolution of text-search configs; Unicode-aware query tokenization and sanitization with shared token-char helpers; providers reuse a consolidated language resolution; forms use safe language-label fallbacks; new management command and tests; added pycountry dependency. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant Providers
participant LanguageUtils
participant PostgresDB
participant Pycountry
Client->>Providers: search/count/retrieve(filters, term)
Providers->>LanguageUtils: _resolve_language(filters)
LanguageUtils->>LanguageUtils: normalize code & build candidates
LanguageUtils->>PostgresDB: fetch pg_ts_config (cached)
alt DB returns configs
PostgresDB-->>LanguageUtils: configs set
else DB error / empty
PostgresDB-->>LanguageUtils: error/empty
LanguageUtils->>LanguageUtils: log warning, use fallback "simple"
end
LanguageUtils->>Pycountry: lookup name variants (if needed)
Pycountry-->>LanguageUtils: name candidates
LanguageUtils-->>Providers: resolved_search_config
Providers->>Providers: sanitize term using is_search_token_char
Providers-->>Client: execute query with resolved config and sanitized term
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello @NumericalAdvantage, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the application's internationalization capabilities by improving language support within the search functionality. It ensures that search queries can accurately process a wider range of characters from various languages, thereby broadening accessibility and improving the user experience for non-English speakers. Additionally, it fortifies the system's resilience by making language selection mechanisms more robust against unexpected or undefined language codes. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request significantly improves language support for the search functionality. The query parser is now Unicode-aware, allowing non-ASCII characters in search terms, which is a great enhancement for international users. The changes to make language dropdowns and language code handling more resilient by using defaults instead of raising errors are excellent for improving the robustness of the application. A new test case for Turkish characters has been added, which is good practice. The implementation is clean and the logic for handling different character types and sanitizing input is correct and secure. The changes are well-executed.
Pull Request Review: Improve Language Support for SearchSummaryThis PR enhances the query parser to properly handle non-ASCII Unicode characters and makes language dropdown forms resilient to unknown language codes. The changes are well-targeted and address real internationalization issues. Strengths
Code Quality ObservationsPositive
Minor Concerns1. Unicode Category Selection (Low Priority) The return category[0] in ("L", "N", "M")Unicode Mark category includes combining characters, diacritics, and enclosing marks. While this is likely correct for search terms, it's worth verifying this doesn't allow unexpected combining characters that could cause issues with PostgreSQL full-text search. Recommendation: This is probably fine, but consider adding a test case with combining diacritics to ensure proper handling. 2. Language Code Normalization In def code_to_language(code: str) -> str:
if not code:
return "simple"
normalized = code.lower()
base = normalized.split("-", 1)[0].split("_", 1)[0]
return LANGUAGES.get(base, "simple")This handles both
Recommendation: Add debug logging when falling back to "simple" for unknown language codes. 3. Missing Test Coverage for New Functions The new helper functions Recommendation: Add tests in 4. No Tests for Language Utilities The changes to Recommendation: Create
SecurityNo security concerns identified. The changes:
PerformanceNo performance concerns. The changes:
Testing Recommendations
Final RecommendationAPPROVE with suggestions for follow-up improvements. The PR successfully achieves its goals:
The suggestions above are enhancements rather than blocking issues:
These can be addressed in this PR or in a follow-up, depending on your timeline. NitpickIn language = code_to_language(search.filters.language)
query = SearchQuery(query_str, search_type="raw", config=language)
filter_query = _build_filter_query(search.filters)
language = code_to_language(search.filters.language) # DuplicateThe second assignment on line 91 is redundant (though harmless). Consider removing it for clarity. Great work on improving internationalization support! |
Code Review for PR #183: Improve Language Support for SearchI've reviewed the changes to improve Unicode support in search queries and enhance language handling. Overall, this is a well-implemented feature with good test coverage. Here are my findings: ✅ Strengths
🔍 Potential Issues1. Cache Invalidation Strategy (radis/pgsearch/utils/language_utils.py:12)The Consideration: Is this acceptable? The documentation says to restart RADIS, but consider:
2. Performance: Repeated Database Queries (radis/pgsearch/providers.py:88-140)Every search call executes Suggestion: Consider caching @lru_cache(maxsize=128)
def code_to_language(code: str) -> str:
# ... existing implementationThis would cache the mapping for commonly used language codes. 3. Unicode Normalization Form (radis/search/utils/query_parser.py:61-65)The code checks Unicode categories but doesn't normalize the query string itself. Consider if users paste text with different Unicode forms (NFC vs NFD). Example: "café" could be:
Recommendation: Add normalization in def parse(self, query: str) -> tuple[QueryNode | None, list[str]]:
query = unicodedata.normalize('NFC', query) # Normalize early
# ... rest of method4. Test Gap: Special Unicode Categories (radis/search/tests/test_query_parser.py:149-163)Tests cover basic cases but miss some edge cases:
If these should be excluded, the tests correctly validate behavior. If they should be allowed, 5. Error Recovery Silent Failure (radis/pgsearch/utils/language_utils.py:18-20)Database errors return empty set, falling back to 'simple' config. This means:
Suggestion: Log at except DatabaseError as exc:
logger.warning("Failed to read pg_ts_config; falling back to simple. %s", exc)
return set()6. Duplicate Variable Assignment Removed (radis/pgsearch/providers.py:88-91)Good catch removing duplicate 🔐 Security Considerations
🚀 Performance Considerations
📝 Style & Best Practices
🎯 Recommendations SummaryHigh Priority:
Medium Priority: Low Priority: ✨ ConclusionThis is a solid PR that meaningfully improves international language support. The code is well-tested, follows project conventions, and handles edge cases thoughtfully. The recommendations above are primarily optimizations rather than blockers. Great work on the Unicode-aware implementation and comprehensive test coverage! 🎉 |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@radis/pgsearch/utils/language_utils.py`:
- Around line 12-20: The current `@lru_cache` on get_available_search_configs
causes a transient DatabaseError to cache an empty set; replace the decorator
with a manual module-level cache so only successful reads are stored: add a
module variable (e.g., _cached_search_configs: set[str] | None = None), have
get_available_search_configs return the cache if set, otherwise query inside the
try, assign the lowercased result to _cached_search_configs on success and
return it, and in the except block log and return an empty set without setting
the cache; reference get_available_search_configs and the except/try block when
making the change.
| @lru_cache(maxsize=1) | ||
| def get_available_search_configs() -> set[str]: | ||
| try: | ||
| with connection.cursor() as cursor: | ||
| cursor.execute("SELECT cfgname FROM pg_ts_config") | ||
| return {row[0].lower() for row in cursor.fetchall()} | ||
| except DatabaseError as exc: | ||
| logger.debug("Failed to read pg_ts_config; falling back to simple. %s", exc) | ||
| return set() |
There was a problem hiding this comment.
Don’t cache failed pg_ts_config reads.
With @lru_cache, a transient DatabaseError will cache an empty set, forcing a permanent fallback to simple until restart. Cache only successful reads.
🧯 Suggested fix
-@lru_cache(maxsize=1)
-def get_available_search_configs() -> set[str]:
- try:
- with connection.cursor() as cursor:
- cursor.execute("SELECT cfgname FROM pg_ts_config")
- return {row[0].lower() for row in cursor.fetchall()}
- except DatabaseError as exc:
- logger.debug("Failed to read pg_ts_config; falling back to simple. %s", exc)
- return set()
+@lru_cache(maxsize=1)
+def _get_available_search_configs_cached() -> set[str]:
+ with connection.cursor() as cursor:
+ cursor.execute("SELECT cfgname FROM pg_ts_config")
+ return {row[0].lower() for row in cursor.fetchall()}
+
+
+def get_available_search_configs() -> set[str]:
+ try:
+ return _get_available_search_configs_cached()
+ except DatabaseError as exc:
+ logger.debug("Failed to read pg_ts_config; falling back to simple. %s", exc)
+ _get_available_search_configs_cached.cache_clear()
+ return set()🤖 Prompt for AI Agents
In `@radis/pgsearch/utils/language_utils.py` around lines 12 - 20, The current
`@lru_cache` on get_available_search_configs causes a transient DatabaseError to
cache an empty set; replace the decorator with a manual module-level cache so
only successful reads are stored: add a module variable (e.g.,
_cached_search_configs: set[str] | None = None), have
get_available_search_configs return the cache if set, otherwise query inside the
try, assign the lowercased result to _cached_search_configs on success and
return it, and in the except block log and return an empty set without setting
the cache; reference get_available_search_configs and the except/try block when
making the change.
…age configurations being installed.
Pull Request Review: Improve Language Support for SearchThis PR enhances RADIS search with Unicode-aware query parsing and dynamic language detection from PostgreSQL. Strengths
Critical Issues1. ReDoS Vulnerability (HIGH PRIORITY)Location: radis/pgsearch/utils/language_utils.py:33 The regex pattern [^A-Za-z0-9]+ can cause catastrophic backtracking with alternating valid/invalid characters. Recommendation: Replace with non-backtracking approach by removing the + quantifier or using string iteration. 2. Language Config Validation (MEDIUM PRIORITY)Location: radis/pgsearch/providers.py:89,104,132 While code_to_language() validates against pg_ts_config, add explicit format validation to ensure language only contains safe characters. 3. Cache Invalidation Missing (MEDIUM PRIORITY)The @lru_cache never refreshes when PostgreSQL configs change. Consider adding a management command to clear the cache. Other RecommendationsShould Fix:
Nice to Have:
Overall AssessmentGrade: B+ (Good with reservations) Solid internationalization improvement with good test coverage. The dynamic language detection is elegant and Unicode handling is robust. However, the ReDoS vulnerability must be fixed before merging. Once the security issue is addressed and basic integration testing is added, this will be ready to merge. Great work! |
PR Review: Improve Language Support for SearchThank you for this comprehensive improvement to language support! This is a well-structured PR that addresses several important issues. Here's my detailed feedback: ✅ Strengths
🔍 Issues & Concerns1. Cache Invalidation Issue (Critical)def clear_search_config_cache() -> None:
get_available_search_configs.cache_clear()This function calls def clear_search_config_cache() -> None:
_get_available_search_configs_cached.cache_clear()The actual cached function is Location: 2. SQL Injection Risk in pg_ts_config Query (High Priority)cursor.execute("SELECT cfgname FROM pg_ts_config")While the current query is safe, it's vulnerable if this code is ever refactored to accept parameters. Consider using parameterized queries or adding a security comment explaining why this is safe. Location: 3. Duplicate Language Resolution LogicIn Location: 4. Missing Documentation on pycountry DependencyThe new dependency on
Consider adding docstrings to Location: 5. Query Parser Regex Too Permissiveword = ~(not_ | and_ | or_) + pp.Regex(r"[^\s()]+").set_parse_action(The regex word = ~(not_ | and_ | or_) + pp.Regex(r"[\w\-'\p{L}\p{M}]+", re.UNICODE).set_parse_action(Though pyparsing's Unicode support may be limited, so the current approach might be pragmatic. Location: 🎯 Suggestions for Improvement
📋 Code Quality Assessment
🔒 Security Review
⚡ Performance Considerations
🧪 TestingThe test coverage is excellent, but consider adding:
📝 DocumentationThe new
🎉 VerdictThis is a solid improvement that significantly enhances language support. The critical cache clearing bug needs to be fixed before merging, but otherwise this is well-architected and thoroughly tested code. Recommendation: Request changes for the cache invalidation bug, then approve after fix. Critical Issues: 1 |
…ation and add tests for non ASCII characters
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@radis/pgsearch/utils/language_utils.py`:
- Around line 68-69: The clear_search_config_cache function calls
get_available_search_configs.cache_clear(), but the lru_cache is applied to
_get_available_search_configs_cached, so calling cache_clear on the wrong symbol
raises AttributeError; update clear_search_config_cache to call
_get_available_search_configs_cached.cache_clear() (or alternatively move the
`@lru_cache` decorator to get_available_search_configs) so the cache_clear call
targets the actual cached function (_get_available_search_configs_cached).
Pull Request Review: Improve Language Support for SearchSummaryThis PR makes significant improvements to search functionality by adding Unicode-aware query parsing and dynamic PostgreSQL text search config resolution. The changes address important internationalization needs for non-ASCII languages. Positive Aspects ✅1. Excellent Test Coverage
2. Security-Conscious Implementation
3. Robust Fallback Mechanism
4. Unicode-Aware Character Handling
Issues & Recommendations 🔍1. Cache Invalidation Concern
|
Code Review: Improve Language Support for SearchThis PR makes meaningful improvements to Unicode handling and language support in the search system. The changes are well-structured with good test coverage. Here are my findings: Strengths1. Unicode-Aware Query Parsing ✅The shift from pyparsing's limited character sets to Unicode category checking is excellent: def is_search_token_char(char: str) -> bool:
if char in SAFE_TERM_CHARS:
return True
category = unicodedata.category(char)
return category[0] in ('L', 'N', 'M')This properly handles Turkish characters (ş, ğ, ı), combining diacritics, and other non-ASCII scripts. 2. Dynamic Language Configuration ✅The runtime resolution of PostgreSQL text search configs is a significant improvement over hardcoded mappings:
3. Security Hardening ✅Good SQL injection prevention in 4. Excellent Test Coverage ✅The test suite covers edge cases well:
5. Graceful Degradation ✅Forms now use Issues & Recommendations1. Cache Invalidation Risk
|
Pull Request Review: Improve Language Support for SearchThis PR makes important improvements to language handling in the search functionality. Overall, the implementation is well-thought-out with good test coverage. Below are my detailed findings: ✅ Strengths
🔍 Issues & ConcernsCritical Issues
Medium Priority Issues
Minor Issues
📝 Code Quality Observations
🎯 Recommendations
✨ Overall AssessmentThis is a solid improvement to the codebase. The Unicode support is properly implemented, security is taken seriously, and the test coverage is comprehensive. The main concerns are:
Recommendation: ✅ Approve with minor suggestions The code is production-ready. The suggestions above are enhancements rather than blockers. Great work on the test coverage and graceful error handling! Note: I couldn't run the actual tests or check CI status, so please verify all tests pass and linting is clean before merging. |
Make query parsing Unicode‑aware so non‑ASCII letters aren’t stripped.
Make language dropdowns resilient to unknown language codes.
Add a test case covering Turkish characters in the query parser.
Summary by CodeRabbit
New Features
Bug Fixes
Documentation
Tests
Chores
✏️ Tip: You can customize this high-level summary in your review settings.