improve language support for search by NumericalAdvantage · Pull Request #183 · openradx/radis

NumericalAdvantage · 2026-01-19T15:38:29Z

Make query parsing Unicode‑aware so non‑ASCII letters aren’t stripped.
Make language dropdowns resilient to unknown language codes.
Add a test case covering Turkish characters in the query parser.

Summary by CodeRabbit

New Features
- Unicode-aware search tokenization, improved handling of unbalanced quotes/parentheses, runtime resolution of search language configs, and an admin command to refresh cached search configs
Bug Fixes
- Safer query sanitization and graceful fallback to language codes when labels are missing; more tolerant language-code matching
Documentation
- Added maintenance notes on search language configs
Tests
- Expanded tests for token helpers and language resolution
Chores
- Added dependency to support enhanced language handling

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2026-01-19T15:38:41Z

📝 Walkthrough

Walkthrough

Postgres-backed discovery and cached resolution of text-search configs; Unicode-aware query tokenization and sanitization with shared token-char helpers; providers reuse a consolidated language resolution; forms use safe language-label fallbacks; new management command and tests; added pycountry dependency.

Changes

Cohort / File(s)	Summary
Language label fallbacks `radis/extractions/forms.py`, `radis/search/forms.py`, `radis/subscriptions/forms.py`	Replace `LANGUAGE_LABELS[language.code]` with `LANGUAGE_LABELS.get(language.code, language.code)` to avoid KeyError and fall back to the language code as label.
Runtime language resolution & caching `radis/pgsearch/utils/language_utils.py`, `radis/pgsearch/tests/test_language_utils.py`, `radis/pgsearch/management/commands/refresh_search_configs.py`	Replace static LANGUAGES mapping with DB-backed discovery of Postgres text-search configs, add normalization and candidate generation (pycountry), caching and clear-cache helper, DB error handling with fallback to `"simple"`, and tests plus a management command to clear the cache.
Query parser Unicode & robustness `radis/search/utils/query_parser.py`, `radis/search/tests/test_query_parser.py`	Add `SAFE_TERM_CHARS`, `is_search_token_char`, `is_search_query_char`; broaden token-char rules and token regex; handle unbalanced quotes/parentheses; update tests for diacritics and composed characters.
Search providers sanitization & language reuse `radis/pgsearch/providers.py`, `radis/pgsearch/tests/__init__.py`	Remove local pyparsing-based whitelist, import/use `is_search_token_char`, add `_resolve_language(filters)` to compute and reuse resolved language across search/count/retrieve call sites.
Reports view changes `radis/reports/api/viewsets.py`	Replace `.id` usages with `.pk` for existing related objects and newly created through rows; some multiline reformatting.
Docs & dependency `docs/Maintenance.md`, `pyproject.toml`	Document Postgres text-search config behavior and reindex requirement; add `pycountry>=24.6.1` dependency.
Management package housekeeping `radis/pgsearch/management/__init__.py`, `radis/pgsearch/management/commands/__init__.py`	Add module docstrings for management package and commands.
Tests / view update `radis/search/tests/test_views.py`	Update search view test to use a diacritic-containing query token (`"krüşk"`).

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Providers
    participant LanguageUtils
    participant PostgresDB
    participant Pycountry

    Client->>Providers: search/count/retrieve(filters, term)
    Providers->>LanguageUtils: _resolve_language(filters)
    LanguageUtils->>LanguageUtils: normalize code & build candidates
    LanguageUtils->>PostgresDB: fetch pg_ts_config (cached)
    alt DB returns configs
        PostgresDB-->>LanguageUtils: configs set
    else DB error / empty
        PostgresDB-->>LanguageUtils: error/empty
        LanguageUtils->>LanguageUtils: log warning, use fallback "simple"
    end
    LanguageUtils->>Pycountry: lookup name variants (if needed)
    Pycountry-->>LanguageUtils: name candidates
    LanguageUtils-->>Providers: resolved_search_config
    Providers->>Providers: sanitize term using is_search_token_char
    Providers-->>Client: execute query with resolved config and sanitized term

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇 I nibble scripts in Unicode light,

Labels fall back when fonts take flight,
Accents kept safe in query and stream,
DB whispers configs — a searcher’s dream,
Hooray — hopping results in sight!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 3.70% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title "improve language support for search" accurately reflects the main objective of the PR, which includes Unicode-aware query parsing, resilient language dropdowns, and Turkish character support.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist · 2026-01-19T15:38:52Z

Summary of Changes

Hello @NumericalAdvantage, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the application's internationalization capabilities by improving language support within the search functionality. It ensures that search queries can accurately process a wider range of characters from various languages, thereby broadening accessibility and improving the user experience for non-English speakers. Additionally, it fortifies the system's resilience by making language selection mechanisms more robust against unexpected or undefined language codes.

Highlights

Unicode-Aware Query Parsing: The search query parser has been updated to be fully Unicode-aware, ensuring that non-ASCII characters (such as 'Hämatom' or 'krüşk') are correctly recognized and processed rather than being stripped. This is achieved by leveraging Python's unicodedata module and more flexible regex patterns for term recognition.
Resilient Language Dropdowns: Language selection dropdowns across various forms (extraction, search, subscriptions) are now more robust. They gracefully handle unknown or unsupported language codes by displaying the raw language code as a fallback, preventing errors that would occur if the code was not found in predefined labels.
Improved Language Code Handling Utility: The code_to_language utility function has been refactored to normalize language codes (e.g., converting 'en-US' to 'en') and provide a 'simple' fallback for any unsupported or empty language codes, enhancing the overall stability of language-related operations.
New Test Case for Unicode Characters: A new test case has been added to validate the query parser's ability to correctly handle Turkish characters, specifically 'krüşk', confirming the successful implementation of Unicode support.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request significantly improves language support for the search functionality. The query parser is now Unicode-aware, allowing non-ASCII characters in search terms, which is a great enhancement for international users. The changes to make language dropdowns and language code handling more resilient by using defaults instead of raising errors are excellent for improving the robustness of the application. A new test case for Turkish characters has been added, which is good practice. The implementation is clean and the logic for handling different character types and sanitizing input is correct and secure. The changes are well-executed.

claude · 2026-01-19T15:41:12Z

Pull Request Review: Improve Language Support for Search

Summary

This PR enhances the query parser to properly handle non-ASCII Unicode characters and makes language dropdown forms resilient to unknown language codes. The changes are well-targeted and address real internationalization issues.

Strengths

Unicode-Aware Character Validation: The new is_search_token_char() and is_search_query_char() functions use unicodedata.category() to properly classify characters, which correctly handles Turkish, German, and other non-ASCII letters.
Consistent Application: The refactoring ensures the same character validation logic is used in both the query parser and the PostgreSQL search provider (sanitize_term()), eliminating duplication.
Graceful Degradation: The .get() pattern with fallback values in forms prevents KeyErrors when encountering unknown language codes.
Test Coverage: Adding a Turkish test case validates the Unicode handling works for complex non-Latin scripts.

Code Quality Observations

Positive

Clean separation of concerns with dedicated character validation functions
Proper use of Unicode categories (L=Letter, N=Number, M=Mark)
Simplified regex pattern makes the parser more maintainable
Consistent code style following Google Python Style Guide

Minor Concerns

1. Unicode Category Selection (Low Priority)

The is_search_token_char() includes category "M" (Mark):

return category[0] in ("L", "N", "M")

Unicode Mark category includes combining characters, diacritics, and enclosing marks. While this is likely correct for search terms, it's worth verifying this doesn't allow unexpected combining characters that could cause issues with PostgreSQL full-text search.

Recommendation: This is probably fine, but consider adding a test case with combining diacritics to ensure proper handling.

2. Language Code Normalization

In radis/pgsearch/utils/language_utils.py:

def code_to_language(code: str) -> str:
    if not code:
        return "simple"
    normalized = code.lower()
    base = normalized.split("-", 1)[0].split("_", 1)[0]
    return LANGUAGES.get(base, "simple")

This handles both en-US and en_US formats, which is good. However:

The function always returns "simple" for unknown languages, which may mask configuration issues
Consider logging a warning when falling back to "simple" for debugging purposes

Recommendation: Add debug logging when falling back to "simple" for unknown language codes.

3. Missing Test Coverage for New Functions

The new helper functions is_search_token_char() and is_search_query_char() are not directly tested. While they're exercised through the query parser tests, dedicated unit tests would improve confidence and serve as documentation.

Recommendation: Add tests in radis/search/tests/test_query_parser.py for these utility functions.

4. No Tests for Language Utilities

The changes to code_to_language() lack test coverage. This function is critical for search functionality.

Recommendation: Create radis/pgsearch/tests/test_language_utils.py with tests covering:

Known languages (en, de)
Case insensitivity
Locale variants (en-US, en_GB)
Unknown language fallback
Empty string handling

Security

No security concerns identified. The changes:

Don't introduce SQL injection risks (character filtering happens before database queries)
Don't expose sensitive information
Don't change authentication/authorization logic

Performance

No performance concerns. The changes:

Replace pyparsing character sets with lightweight unicodedata lookups (likely faster)
Maintain O(n) complexity for character validation
Don't add database queries or expensive operations

Testing Recommendations

Run existing tests to ensure no regressions:

uv run cli test -- radis/search/tests/test_query_parser.py
uv run cli test -- radis/pgsearch/

Manual testing with various Unicode inputs:
- German: Hamatom Magen-Darm-Trakt
- Turkish: krusk characters
- Mixed: CT Hamatom patient_123
Test language dropdown behavior with an unsupported language code in the database.

Final Recommendation

APPROVE with suggestions for follow-up improvements.

The PR successfully achieves its goals:

Makes query parsing Unicode-aware
Makes language dropdowns resilient to unknown codes
Includes test coverage for Turkish characters

The suggestions above are enhancements rather than blocking issues:

Add logging for unknown language codes (helps debugging)
Add unit tests for the new helper functions (improves maintainability)
Add tests for code_to_language() (prevents regressions)

These can be addressed in this PR or in a follow-up, depending on your timeline.

Nitpick

In radis/pgsearch/providers.py:88-91, code_to_language() is called twice with the same argument:

language = code_to_language(search.filters.language)
query = SearchQuery(query_str, search_type="raw", config=language)
filter_query = _build_filter_query(search.filters)
language = code_to_language(search.filters.language)  # Duplicate

The second assignment on line 91 is redundant (though harmless). Consider removing it for clarity.

Great work on improving internationalization support!

claude · 2026-01-20T11:46:55Z

Code Review for PR #183: Improve Language Support for Search

I've reviewed the changes to improve Unicode support in search queries and enhance language handling. Overall, this is a well-implemented feature with good test coverage. Here are my findings:

✅ Strengths

Excellent Unicode Support: The query parser now properly handles non-ASCII characters using unicodedata.category() checks. This is the correct approach for international character support.
Comprehensive Test Coverage:
- New tests in test_language_utils.py cover edge cases (Turkish, diacritics, locale variants, three-letter codes)
- Tests for Unicode characters in test_query_parser.py (Turkish krüşk, combining characters)
- Database error handling tested with mock
Smart Language Matching: The code_to_language() function is robust with:
- Case-insensitive matching
- Locale variant handling (en-US, en_GB)
- Fallback to pycountry for comprehensive language database
- Graceful degradation to 'simple' config
Good Documentation: The Maintenance.md addition explains the search config caching behavior, which is helpful for operators.
Defensive Coding: Forms now use .get(language.code, language.code) to handle missing language labels gracefully.

🔍 Potential Issues

1. Cache Invalidation Strategy (radis/pgsearch/utils/language_utils.py:12)

The @lru_cache(maxsize=1) on get_available_search_configs() is persistent for the process lifetime. If PostgreSQL configs change dynamically (admin installs new dictionaries), the cache won't update until restart.

Consideration: Is this acceptable? The documentation says to restart RADIS, but consider:

Adding a management command to clear the cache
Time-based cache expiry (using ttl_cache from cachetools)
Or document this limitation more prominently in security/operations docs

2. Performance: Repeated Database Queries (radis/pgsearch/providers.py:88-140)

Every search call executes code_to_language(search.filters.language) which calls the cached get_available_search_configs(). While cached, the pycountry lookup and string processing happens on every search.

Suggestion: Consider caching code_to_language() results with @lru_cache(maxsize=128):

@lru_cache(maxsize=128)
def code_to_language(code: str) -> str:
    # ... existing implementation

This would cache the mapping for commonly used language codes.

3. Unicode Normalization Form (radis/search/utils/query_parser.py:61-65)

The code checks Unicode categories but doesn't normalize the query string itself. Consider if users paste text with different Unicode forms (NFC vs NFD).

Example: "café" could be:

NFC: cafe\u0301 (1 composed character)
NFD: cafe\u0301 (base + combining accent)

Recommendation: Add normalization in _replace_invalid_characters or at parse entry:

def parse(self, query: str) -> tuple[QueryNode | None, list[str]]:
    query = unicodedata.normalize('NFC', query)  # Normalize early
    # ... rest of method

4. Test Gap: Special Unicode Categories (radis/search/tests/test_query_parser.py:149-163)

Tests cover basic cases but miss some edge cases:

Emoji (category 'So') - are these intentionally excluded?
Currency symbols (category 'Sc')
Mathematical symbols

If these should be excluded, the tests correctly validate behavior. If they should be allowed, is_search_token_char needs adjustment.

5. Error Recovery Silent Failure (radis/pgsearch/utils/language_utils.py:18-20)

Database errors return empty set, falling back to 'simple' config. This means:

Legitimate language codes get silently downgraded
No monitoring/alerting of configuration issues

Suggestion: Log at warning level instead of debug, or raise an exception during startup checks:

except DatabaseError as exc:
    logger.warning("Failed to read pg_ts_config; falling back to simple. %s", exc)
    return set()

6. Duplicate Variable Assignment Removed (radis/pgsearch/providers.py:88-91)

Good catch removing duplicate language = code_to_language(...) assignments! Clean refactoring.

🔐 Security Considerations

SQL Injection: The direct SQL in get_available_search_configs() queries pg_ts_config (system catalog). This is safe as it's not parameterized with user input.
Input Sanitization: The sanitize_term() function properly filters characters using is_search_token_char(). This prevents injection of PostgreSQL tsquery operators.

🚀 Performance Considerations

Caching works well: The LRU cache on get_available_search_configs() prevents repeated DB queries.
Unicode category checks: unicodedata.category() is O(1) and efficient.
Minor optimization opportunity: The seen set in code_to_language() (line 66-70) is a good optimization to avoid duplicate lookups.

📝 Style & Best Practices

Follows Google Python Style Guide: Line length ≤100 chars, type hints present.
Good separation of concerns: Character validation moved to reusable helpers.
Test quality: Uses pytest fixtures (monkeypatch) appropriately, tests are isolated.

🎯 Recommendations Summary

High Priority:

Consider caching code_to_language() results for performance
Add Unicode normalization (NFC) at query parse entry

Medium Priority:
3. Upgrade database error log level from debug to warning
4. Clarify emoji/special symbol handling in tests

Low Priority:
5. Add cache invalidation mechanism or management command for config updates
6. Document the cache behavior in operations runbook

✨ Conclusion

This is a solid PR that meaningfully improves international language support. The code is well-tested, follows project conventions, and handles edge cases thoughtfully. The recommendations above are primarily optimizations rather than blockers.

Great work on the Unicode-aware implementation and comprehensive test coverage! 🎉

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@radis/pgsearch/utils/language_utils.py`:
- Around line 12-20: The current `@lru_cache` on get_available_search_configs
causes a transient DatabaseError to cache an empty set; replace the decorator
with a manual module-level cache so only successful reads are stored: add a
module variable (e.g., _cached_search_configs: set[str] | None = None), have
get_available_search_configs return the cache if set, otherwise query inside the
try, assign the lowercased result to _cached_search_configs on success and
return it, and in the except block log and return an empty set without setting
the cache; reference get_available_search_configs and the except/try block when
making the change.

coderabbitai · 2026-01-20T11:51:05Z

+@lru_cache(maxsize=1)
+def get_available_search_configs() -> set[str]:
+    try:
+        with connection.cursor() as cursor:
+            cursor.execute("SELECT cfgname FROM pg_ts_config")
+            return {row[0].lower() for row in cursor.fetchall()}
+    except DatabaseError as exc:
+        logger.debug("Failed to read pg_ts_config; falling back to simple. %s", exc)
+        return set()


⚠️ Potential issue | 🟠 Major

Don’t cache failed pg_ts_config reads.

With @lru_cache, a transient DatabaseError will cache an empty set, forcing a permanent fallback to simple until restart. Cache only successful reads.

🧯 Suggested fix

-@lru_cache(maxsize=1) -def get_available_search_configs() -> set[str]: - try: - with connection.cursor() as cursor: - cursor.execute("SELECT cfgname FROM pg_ts_config") - return {row[0].lower() for row in cursor.fetchall()} - except DatabaseError as exc: - logger.debug("Failed to read pg_ts_config; falling back to simple. %s", exc) - return set() +@lru_cache(maxsize=1) +def _get_available_search_configs_cached() -> set[str]: + with connection.cursor() as cursor: + cursor.execute("SELECT cfgname FROM pg_ts_config") + return {row[0].lower() for row in cursor.fetchall()} + + +def get_available_search_configs() -> set[str]: + try: + return _get_available_search_configs_cached() + except DatabaseError as exc: + logger.debug("Failed to read pg_ts_config; falling back to simple. %s", exc) + _get_available_search_configs_cached.cache_clear() + return set()

🤖 Prompt for AI Agents

In `@radis/pgsearch/utils/language_utils.py` around lines 12 - 20, The current `@lru_cache` on get_available_search_configs causes a transient DatabaseError to cache an empty set; replace the decorator with a manual module-level cache so only successful reads are stored: add a module variable (e.g., _cached_search_configs: set[str] | None = None), have get_available_search_configs return the cache if set, otherwise query inside the try, assign the lowercased result to _cached_search_configs on success and return it, and in the except block log and return an empty set without setting the cache; reference get_available_search_configs and the except/try block when making the change.

…age configurations being installed.

claude · 2026-01-22T11:31:56Z

Pull Request Review: Improve Language Support for Search

This PR enhances RADIS search with Unicode-aware query parsing and dynamic language detection from PostgreSQL.

Strengths

Unicode Character Support - Properly handles combining characters and uses unicodedata.category() for letter/number/mark detection
Dynamic Language Detection - Queries pg_ts_config at runtime with LRU caching and pycountry for ISO 639 resolution
Graceful Error Handling - Falls back to simple config with proper logging
Code Cleanup - Removed duplicate code_to_language() calls in providers.py

Critical Issues

1. ReDoS Vulnerability (HIGH PRIORITY)

Location: radis/pgsearch/utils/language_utils.py:33

The regex pattern [^A-Za-z0-9]+ can cause catastrophic backtracking with alternating valid/invalid characters.

Recommendation: Replace with non-backtracking approach by removing the + quantifier or using string iteration.

2. Language Config Validation (MEDIUM PRIORITY)

Location: radis/pgsearch/providers.py:89,104,132

While code_to_language() validates against pg_ts_config, add explicit format validation to ensure language only contains safe characters.

3. Cache Invalidation Missing (MEDIUM PRIORITY)

The @lru_cache never refreshes when PostgreSQL configs change. Consider adding a management command to clear the cache.

Other Recommendations

Should Fix:

Add integration test with actual non-ASCII search queries
Add explicit input validation regex for language codes

Nice to Have:

Add PostgreSQL text search config installation examples to docs
Add test cases for language names with parentheses
Improve test naming (test_code_to_language_continents)

Overall Assessment

Grade: B+ (Good with reservations)

Solid internationalization improvement with good test coverage. The dynamic language detection is elegant and Unicode handling is robust. However, the ReDoS vulnerability must be fixed before merging.

Once the security issue is addressed and basic integration testing is added, this will be ready to merge.

Great work!

claude · 2026-01-22T11:42:13Z

PR Review: Improve Language Support for Search

Thank you for this comprehensive improvement to language support! This is a well-structured PR that addresses several important issues. Here's my detailed feedback:

✅ Strengths

Unicode-aware search: Excellent improvement making query parsing Unicode-aware by using unicodedata.category() instead of pyparsing's limited character sets. This properly handles Turkish characters like 'ş' and combining diacritics.
Dynamic language resolution: Smart approach querying pg_ts_config at runtime instead of hardcoding a language map. This makes the system flexible and adaptable to different PostgreSQL installations.
Comprehensive test coverage: The new test suite for language_utils.py is thorough, covering edge cases like:
- Case sensitivity
- Locale variants (en-US, en_GB)
- Three-letter ISO codes
- SQL injection attempts
- Database errors
- Multi-word language names
Graceful error handling: Proper fallback to 'simple' config when languages aren't found, with appropriate logging.
Security considerations: Good SQL injection protection with _is_safe_language_code() validation.
Resilient form dropdowns: Using .get(language.code, language.code) prevents KeyError when unknown language codes exist in the database.

🔍 Issues & Concerns

1. Cache Invalidation Issue (Critical)

def clear_search_config_cache() -> None:
    get_available_search_configs.cache_clear()

This function calls .cache_clear() on the wrong function! It should call:

def clear_search_config_cache() -> None:
    _get_available_search_configs_cached.cache_clear()

The actual cached function is _get_available_search_configs_cached, not get_available_search_configs. This means the management command won't actually clear the cache.

Location: radis/pgsearch/utils/language_utils.py:68-69

2. SQL Injection Risk in pg_ts_config Query (High Priority)

cursor.execute("SELECT cfgname FROM pg_ts_config")

While the current query is safe, it's vulnerable if this code is ever refactored to accept parameters. Consider using parameterized queries or adding a security comment explaining why this is safe.

Location: radis/pgsearch/utils/language_utils.py:14

3. Duplicate Language Resolution Logic

In providers.py, lines 88, 131, and 140 all call code_to_language(). The duplicate assignment on line 88 was correctly removed, but the pattern repeats in all three functions (search, count, retrieve). Consider extracting this to a shared helper or documenting why it can't be cached per-request.

Location: radis/pgsearch/providers.py:88, 131, 140

4. Missing Documentation on pycountry Dependency

The new dependency on pycountry significantly changes how language resolution works, but there's no inline documentation explaining:

Why pycountry was chosen over alternatives
What happens when pycountry data is outdated
Performance implications of the lookup

Consider adding docstrings to _language_name_candidates() explaining this design decision.

Location: radis/pgsearch/utils/language_utils.py:42-61

5. Query Parser Regex Too Permissive

word = ~(not_ | and_ | or_) + pp.Regex(r"[^\s()]+").set_parse_action(

The regex [^\s()]+ matches ANY non-whitespace, non-parenthesis character, but then is_search_query_char() filters them out later. This creates unnecessary work. Consider making the regex match the actual allowed characters:

word = ~(not_ | and_ | or_) + pp.Regex(r"[\w\-'\p{L}\p{M}]+", re.UNICODE).set_parse_action(

Though pyparsing's Unicode support may be limited, so the current approach might be pragmatic.

Location: radis/search/utils/query_parser.py:196

🎯 Suggestions for Improvement

Add performance metrics: The @lru_cache(maxsize=1) on _get_available_search_configs_cached() is appropriate, but consider logging cache hits/misses for monitoring.
Test the management command: Add an integration test verifying refresh_search_configs actually clears the cache (this would catch issue add hybrid search feature #1).
Document Unicode normalization: The NFKD normalization in _normalize_language_name() is subtle. Add a comment explaining why NFKD is chosen over NFC/NFD/NFKC.
Consider lazy loading: The current implementation queries pg_ts_config on first search. For high-traffic systems, consider warming this cache during app startup.
Add migration notes: The PR description mentions reindexing reports after installing new configs, but this should also be documented in the code or migrations.

📋 Code Quality Assessment

Style: ✅ Follows Google Python Style Guide and project conventions
Line Length: ✅ Adheres to 100-character limit
Type Hints: ✅ Proper type annotations throughout
Error Handling: ✅ Appropriate try/except with logging
Test Coverage: ✅ Comprehensive tests for new functionality

🔒 Security Review

✅ SQL injection protection via _is_safe_language_code()
✅ Input validation on language codes
✅ Proper error handling prevents information leakage
⚠️ pg_ts_config query could use explicit security documentation

⚡ Performance Considerations

✅ LRU cache minimizes database queries
✅ Early return on empty/invalid codes
✅ Efficient set operations for config lookup
💡 Consider profiling pycountry.languages.lookup() under load

🧪 Testing

The test coverage is excellent, but consider adding:

Integration test for the management command
Performance test for code_to_language() with high concurrency
Test for combining marks in search queries (like your café example)

📝 Documentation

The new docs/Maintenance.md section is helpful but could be expanded:

How to verify which configs are available
Example of installing new PostgreSQL text search dictionaries
Impact on search performance for different languages

🎉 Verdict

This is a solid improvement that significantly enhances language support. The critical cache clearing bug needs to be fixed before merging, but otherwise this is well-architected and thoroughly tested code.

Recommendation: Request changes for the cache invalidation bug, then approve after fix.

Critical Issues: 1
High Priority Issues: 1
Suggestions: 5
Lines Changed: +319/-26

…ation and add tests for non ASCII characters

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@radis/pgsearch/utils/language_utils.py`:
- Around line 68-69: The clear_search_config_cache function calls
get_available_search_configs.cache_clear(), but the lru_cache is applied to
_get_available_search_configs_cached, so calling cache_clear on the wrong symbol
raises AttributeError; update clear_search_config_cache to call
_get_available_search_configs_cached.cache_clear() (or alternatively move the
`@lru_cache` decorator to get_available_search_configs) so the cache_clear call
targets the actual cached function (_get_available_search_configs_cached).

claude · 2026-01-23T13:01:14Z

Pull Request Review: Improve Language Support for Search

Summary

This PR makes significant improvements to search functionality by adding Unicode-aware query parsing and dynamic PostgreSQL text search config resolution. The changes address important internationalization needs for non-ASCII languages.

Positive Aspects ✅

1. Excellent Test Coverage

Comprehensive tests for language resolution (141 lines in test_language_utils.py)
Tests cover edge cases: SQL injection attempts (en;DROP), diacritics, multi-word names, 3-letter codes
Good coverage for Unicode token helpers in query parser tests
Database error handling is well-tested with monkeypatching

2. Security-Conscious Implementation

_is_safe_language_code() validates input before use in queries (line 68-69 in language_utils.py)
Static SQL query with comment explaining no user input (line 14-15 in language_utils.py)
Proper sanitization prevents SQL injection attacks

3. Robust Fallback Mechanism

Graceful degradation to simple config when language not found
Database errors handled without crashing (returns empty set)
Missing language labels fall back to language code in forms

4. Unicode-Aware Character Handling

Uses unicodedata.category() for proper Unicode support (line 64-65 in query_parser.py)
Handles combining marks (Mark category) correctly
NFKD normalization for consistent diacritic handling (line 33 in language_utils.py)

Issues & Recommendations 🔍

1. Cache Invalidation Concern ⚠️

Location: radis/pgsearch/utils/language_utils.py:11-16

The @lru_cache(maxsize=1) decorator caches PostgreSQL configs for the lifetime of the process. While this is good for performance, there's a potential issue:

@lru_cache(maxsize=1)
def _get_available_search_configs_cached() -> set[str]:
    with connection.cursor() as cursor:
        cursor.execute("SELECT cfgname FROM pg_ts_config")
        return {row[0].lower() for row in cursor.fetchall()}

Problem: If new PostgreSQL text search configs are installed at runtime (which the maintenance docs suggest is possible), the cache won't be refreshed until the application restarts.

Recommendation: The refresh_search_configs management command exists but may not be discoverable. Consider:

Adding a comment near the cache explaining when/why to clear it
Documenting the command in CLAUDE.md or making it more visible
Consider TTL-based caching (e.g., with functools.lru_cache + a timestamp check) for long-running processes

Documentation Update: The Maintenance.md note is good, but consider adding it to troubleshooting sections too.

2. Duplicate Language Resolution 🔄

Location: radis/pgsearch/providers.py:89-96, 133-140, 145-152

The _resolve_language() helper was added to reduce duplication, which is excellent. However, I notice it's called three times in very similar patterns:

language = _resolve_language(search.filters)
query = SearchQuery(query_str, search_type="raw", config=language)
filter_query = _build_filter_query(search.filters)

This is clean! The only minor suggestion: consider extracting the common pattern of building both query and filter to reduce the three nearly-identical function bodies.

3. Missing Type Hint 📝

Location: radis/search/utils/query_parser.py:180

def _delete_spaces_at_start_of_parens(self, input_string):  # Missing return type
    return self._modify_unquoted_segments(input_string, lambda s: re.sub(r"\(\s*", "(", s))

Should be: def _delete_spaces_at_start_of_parens(self, input_string: str) -> str:

4. Performance Consideration 💡

Location: radis/pgsearch/utils/language_utils.py:45-65

The _language_name_candidates() function calls pycountry.languages.get() and potentially pycountry.languages.lookup() on every language code resolution. For frequently-used language codes (en, de, etc.), this could be cached at the module level.

Suggestion:

@lru_cache(maxsize=128)  # Cache language code -> candidates mapping
def _language_name_candidates(code: str) -> list[str]:
    # ... existing implementation

This would benefit search-heavy workloads where the same language codes are resolved repeatedly.

5. Regex Pattern Broadness 🎯

Location: radis/search/utils/query_parser.py:197

word = ~(not_ | and_ | or_) + pp.Regex(r"[^\s()]+").set_parse_action(
    lambda t: TermNode("WORD", t[0])  # type: ignore
)

The comment says "Input is pre-sanitized; keep regex broad to avoid pyparsing Unicode limitations." This is reasonable, but the regex [^\s()]+ will match ANY non-whitespace, non-paren character, including potentially invalid ones that slipped through sanitization.

Recommendation: Add a defensive assertion or comment explaining why this is safe (because _replace_invalid_characters runs first in the parse() method).

6. Test Isolation 🧪

Location: radis/pgsearch/tests/test_language_utils.py:111, 137

Tests manually call language_utils._get_available_search_configs_cached.cache_clear(). This works but could be fragile if test order matters.

Recommendation: Consider a pytest fixture with autouse to clear the cache before each test:

@pytest.fixture(autouse=True)
def clear_config_cache():
    language_utils._get_available_search_configs_cached.cache_clear()
    yield

7. Minor: Empty init Method 🧹

Location: radis/search/utils/query_parser.py:75-76

def __init__(self):
    pass

This can be removed entirely since it does nothing.

Best Practices Followed 🌟

Defensive programming with SQL injection prevention
Comprehensive logging at appropriate levels (warning for unknown languages, debug for invalid codes)
Backward compatibility with fallback to simple config
Clear separation of concerns (language resolution, normalization, validation)
Good test structure with descriptive test names
Documentation updates alongside code changes

Security Assessment 🔒

No security vulnerabilities found. The PR properly:

Validates language codes before use
Uses static SQL queries
Sanitizes user input in query parser
Handles edge cases (empty strings, special chars)

The test test_code_to_language_invalid_chars specifically validates SQL injection prevention.

Performance Impact 📊

Positive:

Caching of PostgreSQL configs reduces DB queries
Language resolution is efficient with early returns

Neutral:

pycountry lookups add minimal overhead (could be cached as noted above)
Unicode normalization has acceptable performance for typical query lengths

Suggested Improvements (Optional)

Add @lru_cache to _language_name_candidates() for repeated lookups
Consider adding a health check endpoint that validates search config availability
Add integration tests that verify end-to-end search with Turkish/non-ASCII queries
Document the search language resolution algorithm in CLAUDE.md

Conclusion

This is a high-quality PR that significantly improves internationalization support. The implementation is secure, well-tested, and follows Django/Python best practices. The issues identified are minor and mostly relate to potential optimizations or documentation improvements.

Recommendation: Approve with minor suggestions for cache optimization and documentation enhancements.

Great work on the comprehensive test coverage and security-conscious implementation! 🎉

claude · 2026-01-26T16:27:43Z

Code Review: Improve Language Support for Search

This PR makes meaningful improvements to Unicode handling and language support in the search system. The changes are well-structured with good test coverage. Here are my findings:

Strengths

1. Unicode-Aware Query Parsing ✅

The shift from pyparsing's limited character sets to Unicode category checking is excellent:

def is_search_token_char(char: str) -> bool:
    if char in SAFE_TERM_CHARS:
        return True
    category = unicodedata.category(char)
    return category[0] in ('L', 'N', 'M')

This properly handles Turkish characters (ş, ğ, ı), combining diacritics, and other non-ASCII scripts.

2. Dynamic Language Configuration ✅

The runtime resolution of PostgreSQL text search configs is a significant improvement over hardcoded mappings:

Uses pycountry for ISO code → language name mapping
Queries pg_ts_config to discover available configs
Proper fallback to 'simple' when language not found

3. Security Hardening ✅

Good SQL injection prevention in _is_safe_language_code() and safe database query in line 15 of language_utils.py.

4. Excellent Test Coverage ✅

The test suite covers edge cases well:

Turkish characters (test_query_parser.py:40)
Combining diacritics (test_query_parser.py:41)
Invalid language codes with SQL injection attempts (test_language_utils.py:61-63)
Database errors (test_language_utils.py:103-141)

5. Graceful Degradation ✅

Forms now use .get() with fallback to language code when labels are missing (radis/search/forms.py:63, etc.)

Issues & Recommendations

1. Cache Invalidation Risk ⚠️ MEDIUM

Issue: The LRU cache for search configs (@lru_cache(maxsize=1)) persists across the application lifetime. If PostgreSQL configs change (e.g., new language dictionaries installed), the cache won't refresh until restart.

Location: radis/pgsearch/utils/language_utils.py:11-16

Current mitigation: Documentation mentions restart requirement (docs/Maintenance.md), and there's a management command.

Recommendation: Consider adding automatic cache expiration:

from functools import lru_cache
import time

_cache_timestamp = None
CACHE_TTL = 3600  # 1 hour

@lru_cache(maxsize=1)
def _get_available_search_configs_cached(cache_key):
    with connection.cursor() as cursor:
        cursor.execute("SELECT cfgname FROM pg_ts_config")
        return {row[0].lower() for row in cursor.fetchall()}

def get_available_search_configs() -> set[str]:
    global _cache_timestamp
    now = time.time()
    if _cache_timestamp is None or (now - _cache_timestamp) > CACHE_TTL:
        _cache_timestamp = now
        _get_available_search_configs_cached.cache_clear()
    
    try:
        return _get_available_search_configs_cached(int(now / CACHE_TTL))
    except DatabaseError as exc:
        logger.warning("Failed to read pg_ts_config; falling back to simple. %s", exc)
        return set()

Alternative: Document that the management command should be run after config changes.

2. Performance: Redundant Database Queries ⚠️ MEDIUM

Issue: code_to_language() is called for every search operation but queries the database on first call. On high-traffic systems, the initial query could cause a spike.

Location: radis/pgsearch/providers.py:28-29, called from lines 92, 135, 144

Recommendation: Consider warming the cache at application startup (Django AppConfig.ready() method):

# radis/pgsearch/apps.py
from django.apps import AppConfig

class PgsearchConfig(AppConfig):
    name = 'radis.pgsearch'
    
    def ready(self):
        # Warm up the language config cache
        from .utils.language_utils import get_available_search_configs
        try:
            get_available_search_configs()
        except Exception:
            pass  # Will retry on first actual use

3. Code Duplication 📝 MINOR

Issue: The pattern of resolving language and building query appears multiple times in providers.py.

Location: radis/pgsearch/providers.py:90-94, 133-138, 142-146

Current state: Already extracted _resolve_language() helper (line 28-29) ✅

Suggestion: Consider extracting the common pattern further:

def _create_search_query(filters: SearchFilters, query_str: str) -> SearchQuery:
    language = _resolve_language(filters)
    return SearchQuery(query_str, search_type="raw", config=language)

This is a minor improvement and not critical.

4. Regex Simplification in Query Parser 📝 MINOR

Issue: The comment on line 196-197 mentions "avoid pyparsing Unicode limitations" but the regex is now r"[^\s()]+" which is very permissive.

Location: radis/search/utils/query_parser.py:197

Concern: This regex will match ANY non-whitespace, non-paren character, including potentially problematic ones that were sanitized earlier. This relies on the sanitization step happening first.

Recommendation: Add a test to ensure sanitization happens before pyparsing:

def test_sanitization_before_parsing():
    # Ensure invalid chars are removed before pyparsing sees them
    parser = QueryParser()
    node, fixes = parser.parse("test$invalid@chars")
    assert "Fixed invalid characters" in fixes
    assert node is not None

Status: This is already tested implicitly in test_query_parser.py:131-133 ✅

5. Missing Edge Case Tests 📝 MINOR

Missing test cases:

What happens with empty string after normalization in _normalize_language_name()?
- Already handled (lines 30-31, 37-38) ✅
What if pg_ts_config returns duplicate config names with different cases?
- Handled by .lower() in line 16 ✅
What if a language code is exactly "simple"?
- Works correctly as it's checked against configs

Recommendation: Add test for "simple" language code:

def test_code_to_language_simple_passthrough(monkeypatch):
    set_configs(monkeypatch, {"english", "simple"})
    assert code_to_language("simple") == "simple"

6. Documentation Clarity 📝 MINOR

Issue: The maintenance documentation could be more explicit about when to run the refresh command.

Location: docs/Maintenance.md:21-25

Current text:

"restart RADIS to refresh the config cache"

Recommendation: Clarify the alternative:

## Search language configs

RADIS reads available text search configs from Postgres (`pg_ts_config`) and auto-maps
language codes to matching configs (falling back to `simple`). If new dictionaries/configs
are installed in Postgres, you have two options:
1. Restart the RADIS application to refresh the config cache
2. Run `uv run cli manage refresh_search_configs` to clear the cache without restart

After refreshing the cache, reindex reports to apply the new config to existing data.

7. Type Safety 📝 MINOR

Issue: The code_to_language() function always returns a string, but callers might not know that empty input returns "simple".

Location: radis/pgsearch/utils/language_utils.py:76-97

Recommendation: Add a docstring:

def code_to_language(code: str) -> str:
    """
    Map a language code to a PostgreSQL text search configuration name.
    
    Args:
        code: ISO 639-1/639-3 language code (e.g., 'en', 'de', 'tr')
    
    Returns:
        PostgreSQL config name (e.g., 'english', 'german', 'turkish').
        Falls back to 'simple' if the code is empty, invalid, or unsupported.
    """

Security Assessment ✅

SQL Injection: Well protected

Line 15: Static SQL query with no user input ✅
Line 68-69: Input validation prevents injection in language codes ✅
Test coverage for injection attempts (test_language_utils.py:61-63) ✅

ReDoS (Regular Expression Denial of Service): Not a concern

The regex patterns are simple and bounded ✅

Performance Considerations

Database Query Caching: ✅ LRU cache prevents repeated queries
Unicode Normalization: NFKD normalization is fast for typical language names
pycountry Lookups: Library uses internal caching, should be fast

Overall Assessment

This is a well-executed PR that meaningfully improves internationalization support. The code quality is high, with good separation of concerns, comprehensive tests, and defensive programming.

Approval Recommendation: ✅ APPROVE with minor suggestions

The issues identified are minor and don't block merging. Consider addressing the cache invalidation strategy and adding startup cache warming as follow-up improvements.

Priority of Recommendations:

HIGH: Add cache warming at startup (prevents first-request latency)
MEDIUM: Consider TTL-based cache expiration (operational improvement)
LOW: Add docstrings and minor test cases (code quality)

Great work on improving Unicode support! The Turkish character handling and dynamic language resolution are significant improvements for international deployments.

claude · 2026-01-27T12:57:32Z

Pull Request Review: Improve Language Support for Search

This PR makes important improvements to language handling in the search functionality. Overall, the implementation is well-thought-out with good test coverage. Below are my detailed findings:

✅ Strengths

Comprehensive Test Coverage: Excellent test suite in test_language_utils.py covering edge cases including:
- Case-insensitive matching
- Locale variants (en-US, en_GB)
- Three-letter ISO codes
- Database error handling
- SQL injection prevention (en;DROP)
Security Improvements:
- Input validation in _is_safe_language_code() prevents SQL injection
- The SQL query in _get_available_search_configs_cached() is properly static with no user input (line 15)
Unicode Support:
- Proper Unicode normalization using NFKD decomposition (line 33 in language_utils.py)
- Unicode-aware character classification using unicodedata.category() (lines 64-65 in query_parser.py)
- Handles combining characters correctly (Mark category 'M')
Graceful Degradation: Falls back to 'simple' config when:
- Language code is empty or invalid
- Database query fails
- No matching config found
Caching Strategy: LRU cache with maxsize=1 is appropriate since configs rarely change, and the clear_search_config_cache() function allows manual refresh.
Documentation: New maintenance documentation clearly explains the need to restart RADIS after installing new Postgres configs.

🔍 Issues & Concerns

Critical Issues

Inconsistent Attribute Access in Bulk Upsert (radis/reports/api/viewsets.py):
- Line 100: existing.language = language (sets relationship)
- Line 154: .pk used consistently for modality
- Line 168: .pk used consistently for group
- Issue: Mixing .id and .pk for accessing primary keys. While functionally equivalent for most Django models, .pk is more idiomatic as it works correctly even with custom primary key fields. This change is good.
- Note: Setting existing.language = language (line 100) instead of existing.language_id = language.id is actually better practice as it properly handles the relationship object.
Regex Pattern in Query Parser (radis/search/utils/query_parser.py:197):
- Changed from pp.Word(pp.alphanums + pp.alphas8bit + "_-'") to pp.Regex(r"[^\s()]+")
- Concern: The new pattern is extremely broad - it matches ANY non-whitespace, non-parenthesis character
- Risk: Could allow special characters that might cause issues downstream if not properly sanitized
- Mitigation: Input is pre-sanitized by _replace_invalid_characters() which uses is_search_query_char(), so this should be safe
- Question: Why not use pp.Regex(r"[\w\-']+") or similar to be more explicit? The comment says "keep regex broad to avoid pyparsing Unicode limitations" - this could use more explanation.

Medium Priority Issues

Cache Invalidation Strategy:
- Cache is only cleared via manual command (refresh_search_configs)
- If new Postgres text search configs are installed, the app won't detect them until restart or manual cache clear
- Suggestion: Consider adding a cache TTL or detecting config changes at startup
- Note: The maintenance docs address this, but it could still cause confusion
Error Handling in code_to_language():
- Line 84: If get_available_search_configs() returns empty set due to DB error, all language codes will fall back to 'simple'
- This is logged but silent to the user
- Consideration: In a production environment with temporary DB issues, all searches would fall back to 'simple' config, potentially degrading search quality
Performance Consideration:
- _language_name_candidates() (lines 45-65) does string processing and pycountry lookups on every call
- Called within the search path via _resolve_language()
- Suggestion: Consider caching language code to config mappings, not just available configs
- Impact: Probably negligible for typical use, but could matter with high query volume

Minor Issues

Type Safety in sanitize_term() (radis/pgsearch/providers.py:24-25):
- Function doesn't validate input is a string
- If term is None or non-string, the iteration will fail
- Suggestion: Add type validation or use type hints more defensively
Duplicate Code:
- _resolve_language() is just a one-line wrapper around code_to_language()
- Used 3 times in providers.py (lines 92, 135, 145)
- Minor: Could inline this, but the wrapper provides good semantic clarity
Test Coverage Gap:
- No integration tests verifying that Turkish characters (like "krüşk") actually work end-to-end with PostgreSQL
- The unit test in test_query_parser.py is good, but doesn't verify Postgres handles it correctly

📝 Code Quality Observations

Line Length: Changes respect the 100-character limit (per CLAUDE.md)
Naming Conventions: Follow Google Python Style Guide well
Comments:
- Line 33 comment about NFKD is helpful
- Line 14 comment about static query is good for security audit
- Line 196 comment could be more detailed about pyparsing limitations
Error Messages: Clear and actionable (e.g., lines 92-95 in language_utils.py)

🎯 Recommendations

Consider Adding:
- Integration test that actually queries Postgres with Unicode characters
- Benchmark for _language_name_candidates() performance with high query volume
- Cache for language code mappings, not just available configs
Documentation:
- Expand comment on line 196 about pyparsing Unicode limitations
- Add docstring to is_search_token_char() and is_search_query_char() explaining the character categories
Future Enhancement:
- Consider automatic cache invalidation on Postgres config changes (via triggers or periodic checks)
- Add metrics/monitoring for language fallback rate

✨ Overall Assessment

This is a solid improvement to the codebase. The Unicode support is properly implemented, security is taken seriously, and the test coverage is comprehensive. The main concerns are:

The very broad regex pattern (low risk due to pre-sanitization)
Potential performance impact of repeated pycountry lookups (minor)
Cache invalidation UX (documented but manual)

Recommendation: ✅ Approve with minor suggestions

The code is production-ready. The suggestions above are enhancements rather than blockers. Great work on the test coverage and graceful error handling!

Note: I couldn't run the actual tests or check CI status, so please verify all tests pass and linting is clean before merging.

improve language support for search

7016849

gemini-code-assist Bot reviewed Jan 19, 2026

View reviewed changes

Support pgsearch for the languages that are supported by users Postgres

0f4787b

coderabbitai Bot reviewed Jan 20, 2026

View reviewed changes

Avoid using old data from pg_ts_config if it changes due to new langu…

31d960a

…age configurations being installed.

Remove ReDoS concern, improve language config validation, cache valid…

f018278

…ation and add tests for non ASCII characters

coderabbitai Bot reviewed Jan 22, 2026

View reviewed changes

Comment thread radis/pgsearch/utils/language_utils.py Outdated

NumericalAdvantage and others added 4 commits January 23, 2026 14:01

saving branch

795d7b0

Fix bulk upsert model attribute access

6ee8a46

Merge branch 'main' into textSearchLangSupport

551a835

Fix lint line length in bulk upsert

0f57a98

NumericalAdvantage merged commit ca22984 into main Jan 27, 2026
3 checks passed

NumericalAdvantage deleted the textSearchLangSupport branch January 27, 2026 14:21

Conversation

NumericalAdvantage commented Jan 19, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Uh oh!

gemini-code-assist Bot commented Jan 19, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

claude Bot commented Jan 19, 2026

Pull Request Review: Improve Language Support for Search

Summary

Strengths

Code Quality Observations

Positive

Minor Concerns

Security

Performance

Testing Recommendations

Final Recommendation

Nitpick

Uh oh!

claude Bot commented Jan 20, 2026

Code Review for PR #183: Improve Language Support for Search

✅ Strengths

🔍 Potential Issues

1. Cache Invalidation Strategy (radis/pgsearch/utils/language_utils.py:12)

2. Performance: Repeated Database Queries (radis/pgsearch/providers.py:88-140)

3. Unicode Normalization Form (radis/search/utils/query_parser.py:61-65)

4. Test Gap: Special Unicode Categories (radis/search/tests/test_query_parser.py:149-163)

5. Error Recovery Silent Failure (radis/pgsearch/utils/language_utils.py:18-20)

6. Duplicate Variable Assignment Removed (radis/pgsearch/providers.py:88-91)

🔐 Security Considerations

🚀 Performance Considerations

📝 Style & Best Practices

🎯 Recommendations Summary

✨ Conclusion

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot commented Jan 22, 2026

Pull Request Review: Improve Language Support for Search

Strengths

Critical Issues

1. ReDoS Vulnerability (HIGH PRIORITY)

2. Language Config Validation (MEDIUM PRIORITY)

3. Cache Invalidation Missing (MEDIUM PRIORITY)

Other Recommendations

Should Fix:

Nice to Have:

Overall Assessment

Uh oh!

claude Bot commented Jan 22, 2026

PR Review: Improve Language Support for Search

✅ Strengths

🔍 Issues & Concerns

1. Cache Invalidation Issue (Critical)

2. SQL Injection Risk in pg_ts_config Query (High Priority)

3. Duplicate Language Resolution Logic

4. Missing Documentation on pycountry Dependency

5. Query Parser Regex Too Permissive

🎯 Suggestions for Improvement

📋 Code Quality Assessment

NumericalAdvantage commented Jan 19, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Jan 19, 2026 •

edited

Loading