Skip to content

improve language support for search#183

Merged
NumericalAdvantage merged 8 commits intomainfrom
textSearchLangSupport
Jan 27, 2026
Merged

improve language support for search#183
NumericalAdvantage merged 8 commits intomainfrom
textSearchLangSupport

Conversation

@NumericalAdvantage
Copy link
Copy Markdown
Collaborator

@NumericalAdvantage NumericalAdvantage commented Jan 19, 2026

Make query parsing Unicode‑aware so non‑ASCII letters aren’t stripped.
Make language dropdowns resilient to unknown language codes.
Add a test case covering Turkish characters in the query parser.

Summary by CodeRabbit

  • New Features

    • Unicode-aware search tokenization, improved handling of unbalanced quotes/parentheses, runtime resolution of search language configs, and an admin command to refresh cached search configs
  • Bug Fixes

    • Safer query sanitization and graceful fallback to language codes when labels are missing; more tolerant language-code matching
  • Documentation

    • Added maintenance notes on search language configs
  • Tests

    • Expanded tests for token helpers and language resolution
  • Chores

    • Added dependency to support enhanced language handling

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Jan 19, 2026

📝 Walkthrough

Walkthrough

Postgres-backed discovery and cached resolution of text-search configs; Unicode-aware query tokenization and sanitization with shared token-char helpers; providers reuse a consolidated language resolution; forms use safe language-label fallbacks; new management command and tests; added pycountry dependency.

Changes

Cohort / File(s) Summary
Language label fallbacks
radis/extractions/forms.py, radis/search/forms.py, radis/subscriptions/forms.py
Replace LANGUAGE_LABELS[language.code] with LANGUAGE_LABELS.get(language.code, language.code) to avoid KeyError and fall back to the language code as label.
Runtime language resolution & caching
radis/pgsearch/utils/language_utils.py, radis/pgsearch/tests/test_language_utils.py, radis/pgsearch/management/commands/refresh_search_configs.py
Replace static LANGUAGES mapping with DB-backed discovery of Postgres text-search configs, add normalization and candidate generation (pycountry), caching and clear-cache helper, DB error handling with fallback to "simple", and tests plus a management command to clear the cache.
Query parser Unicode & robustness
radis/search/utils/query_parser.py, radis/search/tests/test_query_parser.py
Add SAFE_TERM_CHARS, is_search_token_char, is_search_query_char; broaden token-char rules and token regex; handle unbalanced quotes/parentheses; update tests for diacritics and composed characters.
Search providers sanitization & language reuse
radis/pgsearch/providers.py, radis/pgsearch/tests/__init__.py
Remove local pyparsing-based whitelist, import/use is_search_token_char, add _resolve_language(filters) to compute and reuse resolved language across search/count/retrieve call sites.
Reports view changes
radis/reports/api/viewsets.py
Replace .id usages with .pk for existing related objects and newly created through rows; some multiline reformatting.
Docs & dependency
docs/Maintenance.md, pyproject.toml
Document Postgres text-search config behavior and reindex requirement; add pycountry>=24.6.1 dependency.
Management package housekeeping
radis/pgsearch/management/__init__.py, radis/pgsearch/management/commands/__init__.py
Add module docstrings for management package and commands.
Tests / view update
radis/search/tests/test_views.py
Update search view test to use a diacritic-containing query token ("krüşk").

Sequence Diagram(s)

sequenceDiagram
    participant Client
    participant Providers
    participant LanguageUtils
    participant PostgresDB
    participant Pycountry

    Client->>Providers: search/count/retrieve(filters, term)
    Providers->>LanguageUtils: _resolve_language(filters)
    LanguageUtils->>LanguageUtils: normalize code & build candidates
    LanguageUtils->>PostgresDB: fetch pg_ts_config (cached)
    alt DB returns configs
        PostgresDB-->>LanguageUtils: configs set
    else DB error / empty
        PostgresDB-->>LanguageUtils: error/empty
        LanguageUtils->>LanguageUtils: log warning, use fallback "simple"
    end
    LanguageUtils->>Pycountry: lookup name variants (if needed)
    Pycountry-->>LanguageUtils: name candidates
    LanguageUtils-->>Providers: resolved_search_config
    Providers->>Providers: sanitize term using is_search_token_char
    Providers-->>Client: execute query with resolved config and sanitized term
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇 I nibble scripts in Unicode light,

Labels fall back when fonts take flight,
Accents kept safe in query and stream,
DB whispers configs — a searcher’s dream,
Hooray — hopping results in sight!

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 3.70% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title "improve language support for search" accurately reflects the main objective of the PR, which includes Unicode-aware query parsing, resilient language dropdowns, and Turkish character support.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello @NumericalAdvantage, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the application's internationalization capabilities by improving language support within the search functionality. It ensures that search queries can accurately process a wider range of characters from various languages, thereby broadening accessibility and improving the user experience for non-English speakers. Additionally, it fortifies the system's resilience by making language selection mechanisms more robust against unexpected or undefined language codes.

Highlights

  • Unicode-Aware Query Parsing: The search query parser has been updated to be fully Unicode-aware, ensuring that non-ASCII characters (such as 'Hämatom' or 'krüşk') are correctly recognized and processed rather than being stripped. This is achieved by leveraging Python's unicodedata module and more flexible regex patterns for term recognition.
  • Resilient Language Dropdowns: Language selection dropdowns across various forms (extraction, search, subscriptions) are now more robust. They gracefully handle unknown or unsupported language codes by displaying the raw language code as a fallback, preventing errors that would occur if the code was not found in predefined labels.
  • Improved Language Code Handling Utility: The code_to_language utility function has been refactored to normalize language codes (e.g., converting 'en-US' to 'en') and provide a 'simple' fallback for any unsupported or empty language codes, enhancing the overall stability of language-related operations.
  • New Test Case for Unicode Characters: A new test case has been added to validate the query parser's ability to correctly handle Turkish characters, specifically 'krüşk', confirming the successful implementation of Unicode support.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request significantly improves language support for the search functionality. The query parser is now Unicode-aware, allowing non-ASCII characters in search terms, which is a great enhancement for international users. The changes to make language dropdowns and language code handling more resilient by using defaults instead of raising errors are excellent for improving the robustness of the application. A new test case for Turkish characters has been added, which is good practice. The implementation is clean and the logic for handling different character types and sanitizing input is correct and secure. The changes are well-executed.

@claude
Copy link
Copy Markdown

claude Bot commented Jan 19, 2026

Pull Request Review: Improve Language Support for Search

Summary

This PR enhances the query parser to properly handle non-ASCII Unicode characters and makes language dropdown forms resilient to unknown language codes. The changes are well-targeted and address real internationalization issues.


Strengths

  1. Unicode-Aware Character Validation: The new is_search_token_char() and is_search_query_char() functions use unicodedata.category() to properly classify characters, which correctly handles Turkish, German, and other non-ASCII letters.

  2. Consistent Application: The refactoring ensures the same character validation logic is used in both the query parser and the PostgreSQL search provider (sanitize_term()), eliminating duplication.

  3. Graceful Degradation: The .get() pattern with fallback values in forms prevents KeyErrors when encountering unknown language codes.

  4. Test Coverage: Adding a Turkish test case validates the Unicode handling works for complex non-Latin scripts.


Code Quality Observations

Positive

  • Clean separation of concerns with dedicated character validation functions
  • Proper use of Unicode categories (L=Letter, N=Number, M=Mark)
  • Simplified regex pattern makes the parser more maintainable
  • Consistent code style following Google Python Style Guide

Minor Concerns

1. Unicode Category Selection (Low Priority)

The is_search_token_char() includes category "M" (Mark):

return category[0] in ("L", "N", "M")

Unicode Mark category includes combining characters, diacritics, and enclosing marks. While this is likely correct for search terms, it's worth verifying this doesn't allow unexpected combining characters that could cause issues with PostgreSQL full-text search.

Recommendation: This is probably fine, but consider adding a test case with combining diacritics to ensure proper handling.


2. Language Code Normalization

In radis/pgsearch/utils/language_utils.py:

def code_to_language(code: str) -> str:
    if not code:
        return "simple"
    normalized = code.lower()
    base = normalized.split("-", 1)[0].split("_", 1)[0]
    return LANGUAGES.get(base, "simple")

This handles both en-US and en_US formats, which is good. However:

  • The function always returns "simple" for unknown languages, which may mask configuration issues
  • Consider logging a warning when falling back to "simple" for debugging purposes

Recommendation: Add debug logging when falling back to "simple" for unknown language codes.


3. Missing Test Coverage for New Functions

The new helper functions is_search_token_char() and is_search_query_char() are not directly tested. While they're exercised through the query parser tests, dedicated unit tests would improve confidence and serve as documentation.

Recommendation: Add tests in radis/search/tests/test_query_parser.py for these utility functions.


4. No Tests for Language Utilities

The changes to code_to_language() lack test coverage. This function is critical for search functionality.

Recommendation: Create radis/pgsearch/tests/test_language_utils.py with tests covering:

  • Known languages (en, de)
  • Case insensitivity
  • Locale variants (en-US, en_GB)
  • Unknown language fallback
  • Empty string handling

Security

No security concerns identified. The changes:

  • Don't introduce SQL injection risks (character filtering happens before database queries)
  • Don't expose sensitive information
  • Don't change authentication/authorization logic

Performance

No performance concerns. The changes:

  • Replace pyparsing character sets with lightweight unicodedata lookups (likely faster)
  • Maintain O(n) complexity for character validation
  • Don't add database queries or expensive operations

Testing Recommendations

  1. Run existing tests to ensure no regressions:

    uv run cli test -- radis/search/tests/test_query_parser.py
    uv run cli test -- radis/pgsearch/
  2. Manual testing with various Unicode inputs:

    • German: Hamatom Magen-Darm-Trakt
    • Turkish: krusk characters
    • Mixed: CT Hamatom patient_123
  3. Test language dropdown behavior with an unsupported language code in the database.


Final Recommendation

APPROVE with suggestions for follow-up improvements.

The PR successfully achieves its goals:

  • Makes query parsing Unicode-aware
  • Makes language dropdowns resilient to unknown codes
  • Includes test coverage for Turkish characters

The suggestions above are enhancements rather than blocking issues:

  • Add logging for unknown language codes (helps debugging)
  • Add unit tests for the new helper functions (improves maintainability)
  • Add tests for code_to_language() (prevents regressions)

These can be addressed in this PR or in a follow-up, depending on your timeline.


Nitpick

In radis/pgsearch/providers.py:88-91, code_to_language() is called twice with the same argument:

language = code_to_language(search.filters.language)
query = SearchQuery(query_str, search_type="raw", config=language)
filter_query = _build_filter_query(search.filters)
language = code_to_language(search.filters.language)  # Duplicate

The second assignment on line 91 is redundant (though harmless). Consider removing it for clarity.


Great work on improving internationalization support!

@claude
Copy link
Copy Markdown

claude Bot commented Jan 20, 2026

Code Review for PR #183: Improve Language Support for Search

I've reviewed the changes to improve Unicode support in search queries and enhance language handling. Overall, this is a well-implemented feature with good test coverage. Here are my findings:

✅ Strengths

  1. Excellent Unicode Support: The query parser now properly handles non-ASCII characters using unicodedata.category() checks. This is the correct approach for international character support.

  2. Comprehensive Test Coverage:

    • New tests in test_language_utils.py cover edge cases (Turkish, diacritics, locale variants, three-letter codes)
    • Tests for Unicode characters in test_query_parser.py (Turkish krüşk, combining characters)
    • Database error handling tested with mock
  3. Smart Language Matching: The code_to_language() function is robust with:

    • Case-insensitive matching
    • Locale variant handling (en-US, en_GB)
    • Fallback to pycountry for comprehensive language database
    • Graceful degradation to 'simple' config
  4. Good Documentation: The Maintenance.md addition explains the search config caching behavior, which is helpful for operators.

  5. Defensive Coding: Forms now use .get(language.code, language.code) to handle missing language labels gracefully.

🔍 Potential Issues

1. Cache Invalidation Strategy (radis/pgsearch/utils/language_utils.py:12)

The @lru_cache(maxsize=1) on get_available_search_configs() is persistent for the process lifetime. If PostgreSQL configs change dynamically (admin installs new dictionaries), the cache won't update until restart.

Consideration: Is this acceptable? The documentation says to restart RADIS, but consider:

  • Adding a management command to clear the cache
  • Time-based cache expiry (using ttl_cache from cachetools)
  • Or document this limitation more prominently in security/operations docs

2. Performance: Repeated Database Queries (radis/pgsearch/providers.py:88-140)

Every search call executes code_to_language(search.filters.language) which calls the cached get_available_search_configs(). While cached, the pycountry lookup and string processing happens on every search.

Suggestion: Consider caching code_to_language() results with @lru_cache(maxsize=128):

@lru_cache(maxsize=128)
def code_to_language(code: str) -> str:
    # ... existing implementation

This would cache the mapping for commonly used language codes.

3. Unicode Normalization Form (radis/search/utils/query_parser.py:61-65)

The code checks Unicode categories but doesn't normalize the query string itself. Consider if users paste text with different Unicode forms (NFC vs NFD).

Example: "café" could be:

  • NFC: cafe\u0301 (1 composed character)
  • NFD: cafe\u0301 (base + combining accent)

Recommendation: Add normalization in _replace_invalid_characters or at parse entry:

def parse(self, query: str) -> tuple[QueryNode | None, list[str]]:
    query = unicodedata.normalize('NFC', query)  # Normalize early
    # ... rest of method

4. Test Gap: Special Unicode Categories (radis/search/tests/test_query_parser.py:149-163)

Tests cover basic cases but miss some edge cases:

  • Emoji (category 'So') - are these intentionally excluded?
  • Currency symbols (category 'Sc')
  • Mathematical symbols

If these should be excluded, the tests correctly validate behavior. If they should be allowed, is_search_token_char needs adjustment.

5. Error Recovery Silent Failure (radis/pgsearch/utils/language_utils.py:18-20)

Database errors return empty set, falling back to 'simple' config. This means:

  • Legitimate language codes get silently downgraded
  • No monitoring/alerting of configuration issues

Suggestion: Log at warning level instead of debug, or raise an exception during startup checks:

except DatabaseError as exc:
    logger.warning("Failed to read pg_ts_config; falling back to simple. %s", exc)
    return set()

6. Duplicate Variable Assignment Removed (radis/pgsearch/providers.py:88-91)

Good catch removing duplicate language = code_to_language(...) assignments! Clean refactoring.

🔐 Security Considerations

  1. SQL Injection: The direct SQL in get_available_search_configs() queries pg_ts_config (system catalog). This is safe as it's not parameterized with user input.

  2. Input Sanitization: The sanitize_term() function properly filters characters using is_search_token_char(). This prevents injection of PostgreSQL tsquery operators.

🚀 Performance Considerations

  1. Caching works well: The LRU cache on get_available_search_configs() prevents repeated DB queries.

  2. Unicode category checks: unicodedata.category() is O(1) and efficient.

  3. Minor optimization opportunity: The seen set in code_to_language() (line 66-70) is a good optimization to avoid duplicate lookups.

📝 Style & Best Practices

  1. Follows Google Python Style Guide: Line length ≤100 chars, type hints present.

  2. Good separation of concerns: Character validation moved to reusable helpers.

  3. Test quality: Uses pytest fixtures (monkeypatch) appropriately, tests are isolated.

🎯 Recommendations Summary

High Priority:

  1. Consider caching code_to_language() results for performance
  2. Add Unicode normalization (NFC) at query parse entry

Medium Priority:
3. Upgrade database error log level from debug to warning
4. Clarify emoji/special symbol handling in tests

Low Priority:
5. Add cache invalidation mechanism or management command for config updates
6. Document the cache behavior in operations runbook

✨ Conclusion

This is a solid PR that meaningfully improves international language support. The code is well-tested, follows project conventions, and handles edge cases thoughtfully. The recommendations above are primarily optimizations rather than blockers.

Great work on the Unicode-aware implementation and comprehensive test coverage! 🎉

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@radis/pgsearch/utils/language_utils.py`:
- Around line 12-20: The current `@lru_cache` on get_available_search_configs
causes a transient DatabaseError to cache an empty set; replace the decorator
with a manual module-level cache so only successful reads are stored: add a
module variable (e.g., _cached_search_configs: set[str] | None = None), have
get_available_search_configs return the cache if set, otherwise query inside the
try, assign the lowercased result to _cached_search_configs on success and
return it, and in the except block log and return an empty set without setting
the cache; reference get_available_search_configs and the except/try block when
making the change.

Comment on lines +12 to +20
@lru_cache(maxsize=1)
def get_available_search_configs() -> set[str]:
try:
with connection.cursor() as cursor:
cursor.execute("SELECT cfgname FROM pg_ts_config")
return {row[0].lower() for row in cursor.fetchall()}
except DatabaseError as exc:
logger.debug("Failed to read pg_ts_config; falling back to simple. %s", exc)
return set()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Don’t cache failed pg_ts_config reads.

With @lru_cache, a transient DatabaseError will cache an empty set, forcing a permanent fallback to simple until restart. Cache only successful reads.

🧯 Suggested fix
-@lru_cache(maxsize=1)
-def get_available_search_configs() -> set[str]:
-    try:
-        with connection.cursor() as cursor:
-            cursor.execute("SELECT cfgname FROM pg_ts_config")
-            return {row[0].lower() for row in cursor.fetchall()}
-    except DatabaseError as exc:
-        logger.debug("Failed to read pg_ts_config; falling back to simple. %s", exc)
-        return set()
+@lru_cache(maxsize=1)
+def _get_available_search_configs_cached() -> set[str]:
+    with connection.cursor() as cursor:
+        cursor.execute("SELECT cfgname FROM pg_ts_config")
+        return {row[0].lower() for row in cursor.fetchall()}
+
+
+def get_available_search_configs() -> set[str]:
+    try:
+        return _get_available_search_configs_cached()
+    except DatabaseError as exc:
+        logger.debug("Failed to read pg_ts_config; falling back to simple. %s", exc)
+        _get_available_search_configs_cached.cache_clear()
+        return set()
🤖 Prompt for AI Agents
In `@radis/pgsearch/utils/language_utils.py` around lines 12 - 20, The current
`@lru_cache` on get_available_search_configs causes a transient DatabaseError to
cache an empty set; replace the decorator with a manual module-level cache so
only successful reads are stored: add a module variable (e.g.,
_cached_search_configs: set[str] | None = None), have
get_available_search_configs return the cache if set, otherwise query inside the
try, assign the lowercased result to _cached_search_configs on success and
return it, and in the except block log and return an empty set without setting
the cache; reference get_available_search_configs and the except/try block when
making the change.

@claude
Copy link
Copy Markdown

claude Bot commented Jan 22, 2026

Pull Request Review: Improve Language Support for Search

This PR enhances RADIS search with Unicode-aware query parsing and dynamic language detection from PostgreSQL.

Strengths

  1. Unicode Character Support - Properly handles combining characters and uses unicodedata.category() for letter/number/mark detection
  2. Dynamic Language Detection - Queries pg_ts_config at runtime with LRU caching and pycountry for ISO 639 resolution
  3. Graceful Error Handling - Falls back to simple config with proper logging
  4. Code Cleanup - Removed duplicate code_to_language() calls in providers.py

Critical Issues

1. ReDoS Vulnerability (HIGH PRIORITY)

Location: radis/pgsearch/utils/language_utils.py:33

The regex pattern [^A-Za-z0-9]+ can cause catastrophic backtracking with alternating valid/invalid characters.

Recommendation: Replace with non-backtracking approach by removing the + quantifier or using string iteration.

2. Language Config Validation (MEDIUM PRIORITY)

Location: radis/pgsearch/providers.py:89,104,132

While code_to_language() validates against pg_ts_config, add explicit format validation to ensure language only contains safe characters.

3. Cache Invalidation Missing (MEDIUM PRIORITY)

The @lru_cache never refreshes when PostgreSQL configs change. Consider adding a management command to clear the cache.

Other Recommendations

Should Fix:

  • Add integration test with actual non-ASCII search queries
  • Add explicit input validation regex for language codes

Nice to Have:

  • Add PostgreSQL text search config installation examples to docs
  • Add test cases for language names with parentheses
  • Improve test naming (test_code_to_language_continents)

Overall Assessment

Grade: B+ (Good with reservations)

Solid internationalization improvement with good test coverage. The dynamic language detection is elegant and Unicode handling is robust. However, the ReDoS vulnerability must be fixed before merging.

Once the security issue is addressed and basic integration testing is added, this will be ready to merge.

Great work!

@claude
Copy link
Copy Markdown

claude Bot commented Jan 22, 2026

PR Review: Improve Language Support for Search

Thank you for this comprehensive improvement to language support! This is a well-structured PR that addresses several important issues. Here's my detailed feedback:

Strengths

  1. Unicode-aware search: Excellent improvement making query parsing Unicode-aware by using unicodedata.category() instead of pyparsing's limited character sets. This properly handles Turkish characters like 'ş' and combining diacritics.

  2. Dynamic language resolution: Smart approach querying pg_ts_config at runtime instead of hardcoding a language map. This makes the system flexible and adaptable to different PostgreSQL installations.

  3. Comprehensive test coverage: The new test suite for language_utils.py is thorough, covering edge cases like:

    • Case sensitivity
    • Locale variants (en-US, en_GB)
    • Three-letter ISO codes
    • SQL injection attempts
    • Database errors
    • Multi-word language names
  4. Graceful error handling: Proper fallback to 'simple' config when languages aren't found, with appropriate logging.

  5. Security considerations: Good SQL injection protection with _is_safe_language_code() validation.

  6. Resilient form dropdowns: Using .get(language.code, language.code) prevents KeyError when unknown language codes exist in the database.

🔍 Issues & Concerns

1. Cache Invalidation Issue (Critical)

def clear_search_config_cache() -> None:
    get_available_search_configs.cache_clear()

This function calls .cache_clear() on the wrong function! It should call:

def clear_search_config_cache() -> None:
    _get_available_search_configs_cached.cache_clear()

The actual cached function is _get_available_search_configs_cached, not get_available_search_configs. This means the management command won't actually clear the cache.

Location: radis/pgsearch/utils/language_utils.py:68-69

2. SQL Injection Risk in pg_ts_config Query (High Priority)

cursor.execute("SELECT cfgname FROM pg_ts_config")

While the current query is safe, it's vulnerable if this code is ever refactored to accept parameters. Consider using parameterized queries or adding a security comment explaining why this is safe.

Location: radis/pgsearch/utils/language_utils.py:14

3. Duplicate Language Resolution Logic

In providers.py, lines 88, 131, and 140 all call code_to_language(). The duplicate assignment on line 88 was correctly removed, but the pattern repeats in all three functions (search, count, retrieve). Consider extracting this to a shared helper or documenting why it can't be cached per-request.

Location: radis/pgsearch/providers.py:88, 131, 140

4. Missing Documentation on pycountry Dependency

The new dependency on pycountry significantly changes how language resolution works, but there's no inline documentation explaining:

  • Why pycountry was chosen over alternatives
  • What happens when pycountry data is outdated
  • Performance implications of the lookup

Consider adding docstrings to _language_name_candidates() explaining this design decision.

Location: radis/pgsearch/utils/language_utils.py:42-61

5. Query Parser Regex Too Permissive

word = ~(not_ | and_ | or_) + pp.Regex(r"[^\s()]+").set_parse_action(

The regex [^\s()]+ matches ANY non-whitespace, non-parenthesis character, but then is_search_query_char() filters them out later. This creates unnecessary work. Consider making the regex match the actual allowed characters:

word = ~(not_ | and_ | or_) + pp.Regex(r"[\w\-'\p{L}\p{M}]+", re.UNICODE).set_parse_action(

Though pyparsing's Unicode support may be limited, so the current approach might be pragmatic.

Location: radis/search/utils/query_parser.py:196

🎯 Suggestions for Improvement

  1. Add performance metrics: The @lru_cache(maxsize=1) on _get_available_search_configs_cached() is appropriate, but consider logging cache hits/misses for monitoring.

  2. Test the management command: Add an integration test verifying refresh_search_configs actually clears the cache (this would catch issue add hybrid search feature #1).

  3. Document Unicode normalization: The NFKD normalization in _normalize_language_name() is subtle. Add a comment explaining why NFKD is chosen over NFC/NFD/NFKC.

  4. Consider lazy loading: The current implementation queries pg_ts_config on first search. For high-traffic systems, consider warming this cache during app startup.

  5. Add migration notes: The PR description mentions reindexing reports after installing new configs, but this should also be documented in the code or migrations.

📋 Code Quality Assessment

  • Style: ✅ Follows Google Python Style Guide and project conventions
  • Line Length: ✅ Adheres to 100-character limit
  • Type Hints: ✅ Proper type annotations throughout
  • Error Handling: ✅ Appropriate try/except with logging
  • Test Coverage: ✅ Comprehensive tests for new functionality

🔒 Security Review

  • ✅ SQL injection protection via _is_safe_language_code()
  • ✅ Input validation on language codes
  • ✅ Proper error handling prevents information leakage
  • ⚠️ pg_ts_config query could use explicit security documentation

Performance Considerations

  • ✅ LRU cache minimizes database queries
  • ✅ Early return on empty/invalid codes
  • ✅ Efficient set operations for config lookup
  • 💡 Consider profiling pycountry.languages.lookup() under load

🧪 Testing

The test coverage is excellent, but consider adding:

  1. Integration test for the management command
  2. Performance test for code_to_language() with high concurrency
  3. Test for combining marks in search queries (like your café example)

📝 Documentation

The new docs/Maintenance.md section is helpful but could be expanded:

  • How to verify which configs are available
  • Example of installing new PostgreSQL text search dictionaries
  • Impact on search performance for different languages

🎉 Verdict

This is a solid improvement that significantly enhances language support. The critical cache clearing bug needs to be fixed before merging, but otherwise this is well-architected and thoroughly tested code.

Recommendation: Request changes for the cache invalidation bug, then approve after fix.


Critical Issues: 1
High Priority Issues: 1
Suggestions: 5
Lines Changed: +319/-26

…ation and add tests for non ASCII characters
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@radis/pgsearch/utils/language_utils.py`:
- Around line 68-69: The clear_search_config_cache function calls
get_available_search_configs.cache_clear(), but the lru_cache is applied to
_get_available_search_configs_cached, so calling cache_clear on the wrong symbol
raises AttributeError; update clear_search_config_cache to call
_get_available_search_configs_cached.cache_clear() (or alternatively move the
`@lru_cache` decorator to get_available_search_configs) so the cache_clear call
targets the actual cached function (_get_available_search_configs_cached).

Comment thread radis/pgsearch/utils/language_utils.py Outdated
@claude
Copy link
Copy Markdown

claude Bot commented Jan 23, 2026

Pull Request Review: Improve Language Support for Search

Summary

This PR makes significant improvements to search functionality by adding Unicode-aware query parsing and dynamic PostgreSQL text search config resolution. The changes address important internationalization needs for non-ASCII languages.

Positive Aspects ✅

1. Excellent Test Coverage

  • Comprehensive tests for language resolution (141 lines in test_language_utils.py)
  • Tests cover edge cases: SQL injection attempts (en;DROP), diacritics, multi-word names, 3-letter codes
  • Good coverage for Unicode token helpers in query parser tests
  • Database error handling is well-tested with monkeypatching

2. Security-Conscious Implementation

  • _is_safe_language_code() validates input before use in queries (line 68-69 in language_utils.py)
  • Static SQL query with comment explaining no user input (line 14-15 in language_utils.py)
  • Proper sanitization prevents SQL injection attacks

3. Robust Fallback Mechanism

  • Graceful degradation to simple config when language not found
  • Database errors handled without crashing (returns empty set)
  • Missing language labels fall back to language code in forms

4. Unicode-Aware Character Handling

  • Uses unicodedata.category() for proper Unicode support (line 64-65 in query_parser.py)
  • Handles combining marks (Mark category) correctly
  • NFKD normalization for consistent diacritic handling (line 33 in language_utils.py)

Issues & Recommendations 🔍

1. Cache Invalidation Concern ⚠️

Location: radis/pgsearch/utils/language_utils.py:11-16

The @lru_cache(maxsize=1) decorator caches PostgreSQL configs for the lifetime of the process. While this is good for performance, there's a potential issue:

@lru_cache(maxsize=1)
def _get_available_search_configs_cached() -> set[str]:
    with connection.cursor() as cursor:
        cursor.execute("SELECT cfgname FROM pg_ts_config")
        return {row[0].lower() for row in cursor.fetchall()}

Problem: If new PostgreSQL text search configs are installed at runtime (which the maintenance docs suggest is possible), the cache won't be refreshed until the application restarts.

Recommendation: The refresh_search_configs management command exists but may not be discoverable. Consider:

  • Adding a comment near the cache explaining when/why to clear it
  • Documenting the command in CLAUDE.md or making it more visible
  • Consider TTL-based caching (e.g., with functools.lru_cache + a timestamp check) for long-running processes

Documentation Update: The Maintenance.md note is good, but consider adding it to troubleshooting sections too.

2. Duplicate Language Resolution 🔄

Location: radis/pgsearch/providers.py:89-96, 133-140, 145-152

The _resolve_language() helper was added to reduce duplication, which is excellent. However, I notice it's called three times in very similar patterns:

language = _resolve_language(search.filters)
query = SearchQuery(query_str, search_type="raw", config=language)
filter_query = _build_filter_query(search.filters)

This is clean! The only minor suggestion: consider extracting the common pattern of building both query and filter to reduce the three nearly-identical function bodies.

3. Missing Type Hint 📝

Location: radis/search/utils/query_parser.py:180

def _delete_spaces_at_start_of_parens(self, input_string):  # Missing return type
    return self._modify_unquoted_segments(input_string, lambda s: re.sub(r"\(\s*", "(", s))

Should be: def _delete_spaces_at_start_of_parens(self, input_string: str) -> str:

4. Performance Consideration 💡

Location: radis/pgsearch/utils/language_utils.py:45-65

The _language_name_candidates() function calls pycountry.languages.get() and potentially pycountry.languages.lookup() on every language code resolution. For frequently-used language codes (en, de, etc.), this could be cached at the module level.

Suggestion:

@lru_cache(maxsize=128)  # Cache language code -> candidates mapping
def _language_name_candidates(code: str) -> list[str]:
    # ... existing implementation

This would benefit search-heavy workloads where the same language codes are resolved repeatedly.

5. Regex Pattern Broadness 🎯

Location: radis/search/utils/query_parser.py:197

word = ~(not_ | and_ | or_) + pp.Regex(r"[^\s()]+").set_parse_action(
    lambda t: TermNode("WORD", t[0])  # type: ignore
)

The comment says "Input is pre-sanitized; keep regex broad to avoid pyparsing Unicode limitations." This is reasonable, but the regex [^\s()]+ will match ANY non-whitespace, non-paren character, including potentially invalid ones that slipped through sanitization.

Recommendation: Add a defensive assertion or comment explaining why this is safe (because _replace_invalid_characters runs first in the parse() method).

6. Test Isolation 🧪

Location: radis/pgsearch/tests/test_language_utils.py:111, 137

Tests manually call language_utils._get_available_search_configs_cached.cache_clear(). This works but could be fragile if test order matters.

Recommendation: Consider a pytest fixture with autouse to clear the cache before each test:

@pytest.fixture(autouse=True)
def clear_config_cache():
    language_utils._get_available_search_configs_cached.cache_clear()
    yield

7. Minor: Empty init Method 🧹

Location: radis/search/utils/query_parser.py:75-76

def __init__(self):
    pass

This can be removed entirely since it does nothing.

Best Practices Followed 🌟

  1. Defensive programming with SQL injection prevention
  2. Comprehensive logging at appropriate levels (warning for unknown languages, debug for invalid codes)
  3. Backward compatibility with fallback to simple config
  4. Clear separation of concerns (language resolution, normalization, validation)
  5. Good test structure with descriptive test names
  6. Documentation updates alongside code changes

Security Assessment 🔒

No security vulnerabilities found. The PR properly:

  • Validates language codes before use
  • Uses static SQL queries
  • Sanitizes user input in query parser
  • Handles edge cases (empty strings, special chars)

The test test_code_to_language_invalid_chars specifically validates SQL injection prevention.

Performance Impact 📊

Positive:

  • Caching of PostgreSQL configs reduces DB queries
  • Language resolution is efficient with early returns

Neutral:

  • pycountry lookups add minimal overhead (could be cached as noted above)
  • Unicode normalization has acceptable performance for typical query lengths

Suggested Improvements (Optional)

  1. Add @lru_cache to _language_name_candidates() for repeated lookups
  2. Consider adding a health check endpoint that validates search config availability
  3. Add integration tests that verify end-to-end search with Turkish/non-ASCII queries
  4. Document the search language resolution algorithm in CLAUDE.md

Conclusion

This is a high-quality PR that significantly improves internationalization support. The implementation is secure, well-tested, and follows Django/Python best practices. The issues identified are minor and mostly relate to potential optimizations or documentation improvements.

Recommendation: Approve with minor suggestions for cache optimization and documentation enhancements.

Great work on the comprehensive test coverage and security-conscious implementation! 🎉

@claude
Copy link
Copy Markdown

claude Bot commented Jan 26, 2026

Code Review: Improve Language Support for Search

This PR makes meaningful improvements to Unicode handling and language support in the search system. The changes are well-structured with good test coverage. Here are my findings:

Strengths

1. Unicode-Aware Query Parsing

The shift from pyparsing's limited character sets to Unicode category checking is excellent:

def is_search_token_char(char: str) -> bool:
    if char in SAFE_TERM_CHARS:
        return True
    category = unicodedata.category(char)
    return category[0] in ('L', 'N', 'M')

This properly handles Turkish characters (ş, ğ, ı), combining diacritics, and other non-ASCII scripts.

2. Dynamic Language Configuration

The runtime resolution of PostgreSQL text search configs is a significant improvement over hardcoded mappings:

  • Uses pycountry for ISO code → language name mapping
  • Queries pg_ts_config to discover available configs
  • Proper fallback to 'simple' when language not found

3. Security Hardening

Good SQL injection prevention in _is_safe_language_code() and safe database query in line 15 of language_utils.py.

4. Excellent Test Coverage

The test suite covers edge cases well:

  • Turkish characters (test_query_parser.py:40)
  • Combining diacritics (test_query_parser.py:41)
  • Invalid language codes with SQL injection attempts (test_language_utils.py:61-63)
  • Database errors (test_language_utils.py:103-141)

5. Graceful Degradation

Forms now use .get() with fallback to language code when labels are missing (radis/search/forms.py:63, etc.)


Issues & Recommendations

1. Cache Invalidation Risk ⚠️ MEDIUM

Issue: The LRU cache for search configs (@lru_cache(maxsize=1)) persists across the application lifetime. If PostgreSQL configs change (e.g., new language dictionaries installed), the cache won't refresh until restart.

Location: radis/pgsearch/utils/language_utils.py:11-16

Current mitigation: Documentation mentions restart requirement (docs/Maintenance.md), and there's a management command.

Recommendation: Consider adding automatic cache expiration:

from functools import lru_cache
import time

_cache_timestamp = None
CACHE_TTL = 3600  # 1 hour

@lru_cache(maxsize=1)
def _get_available_search_configs_cached(cache_key):
    with connection.cursor() as cursor:
        cursor.execute("SELECT cfgname FROM pg_ts_config")
        return {row[0].lower() for row in cursor.fetchall()}

def get_available_search_configs() -> set[str]:
    global _cache_timestamp
    now = time.time()
    if _cache_timestamp is None or (now - _cache_timestamp) > CACHE_TTL:
        _cache_timestamp = now
        _get_available_search_configs_cached.cache_clear()
    
    try:
        return _get_available_search_configs_cached(int(now / CACHE_TTL))
    except DatabaseError as exc:
        logger.warning("Failed to read pg_ts_config; falling back to simple. %s", exc)
        return set()

Alternative: Document that the management command should be run after config changes.


2. Performance: Redundant Database Queries ⚠️ MEDIUM

Issue: code_to_language() is called for every search operation but queries the database on first call. On high-traffic systems, the initial query could cause a spike.

Location: radis/pgsearch/providers.py:28-29, called from lines 92, 135, 144

Recommendation: Consider warming the cache at application startup (Django AppConfig.ready() method):

# radis/pgsearch/apps.py
from django.apps import AppConfig

class PgsearchConfig(AppConfig):
    name = 'radis.pgsearch'
    
    def ready(self):
        # Warm up the language config cache
        from .utils.language_utils import get_available_search_configs
        try:
            get_available_search_configs()
        except Exception:
            pass  # Will retry on first actual use

3. Code Duplication 📝 MINOR

Issue: The pattern of resolving language and building query appears multiple times in providers.py.

Location: radis/pgsearch/providers.py:90-94, 133-138, 142-146

Current state: Already extracted _resolve_language() helper (line 28-29) ✅

Suggestion: Consider extracting the common pattern further:

def _create_search_query(filters: SearchFilters, query_str: str) -> SearchQuery:
    language = _resolve_language(filters)
    return SearchQuery(query_str, search_type="raw", config=language)

This is a minor improvement and not critical.


4. Regex Simplification in Query Parser 📝 MINOR

Issue: The comment on line 196-197 mentions "avoid pyparsing Unicode limitations" but the regex is now r"[^\s()]+" which is very permissive.

Location: radis/search/utils/query_parser.py:197

Concern: This regex will match ANY non-whitespace, non-paren character, including potentially problematic ones that were sanitized earlier. This relies on the sanitization step happening first.

Recommendation: Add a test to ensure sanitization happens before pyparsing:

def test_sanitization_before_parsing():
    # Ensure invalid chars are removed before pyparsing sees them
    parser = QueryParser()
    node, fixes = parser.parse("test$invalid@chars")
    assert "Fixed invalid characters" in fixes
    assert node is not None

Status: This is already tested implicitly in test_query_parser.py:131-133 ✅


5. Missing Edge Case Tests 📝 MINOR

Missing test cases:

  1. What happens with empty string after normalization in _normalize_language_name()?
    • Already handled (lines 30-31, 37-38) ✅
  2. What if pg_ts_config returns duplicate config names with different cases?
    • Handled by .lower() in line 16 ✅
  3. What if a language code is exactly "simple"?
    • Works correctly as it's checked against configs

Recommendation: Add test for "simple" language code:

def test_code_to_language_simple_passthrough(monkeypatch):
    set_configs(monkeypatch, {"english", "simple"})
    assert code_to_language("simple") == "simple"

6. Documentation Clarity 📝 MINOR

Issue: The maintenance documentation could be more explicit about when to run the refresh command.

Location: docs/Maintenance.md:21-25

Current text:

"restart RADIS to refresh the config cache"

Recommendation: Clarify the alternative:

## Search language configs

RADIS reads available text search configs from Postgres (`pg_ts_config`) and auto-maps
language codes to matching configs (falling back to `simple`). If new dictionaries/configs
are installed in Postgres, you have two options:
1. Restart the RADIS application to refresh the config cache
2. Run `uv run cli manage refresh_search_configs` to clear the cache without restart

After refreshing the cache, reindex reports to apply the new config to existing data.

7. Type Safety 📝 MINOR

Issue: The code_to_language() function always returns a string, but callers might not know that empty input returns "simple".

Location: radis/pgsearch/utils/language_utils.py:76-97

Recommendation: Add a docstring:

def code_to_language(code: str) -> str:
    """
    Map a language code to a PostgreSQL text search configuration name.
    
    Args:
        code: ISO 639-1/639-3 language code (e.g., 'en', 'de', 'tr')
    
    Returns:
        PostgreSQL config name (e.g., 'english', 'german', 'turkish').
        Falls back to 'simple' if the code is empty, invalid, or unsupported.
    """

Security Assessment ✅

SQL Injection: Well protected

  • Line 15: Static SQL query with no user input ✅
  • Line 68-69: Input validation prevents injection in language codes ✅
  • Test coverage for injection attempts (test_language_utils.py:61-63) ✅

ReDoS (Regular Expression Denial of Service): Not a concern

  • The regex patterns are simple and bounded ✅

Performance Considerations

  1. Database Query Caching: ✅ LRU cache prevents repeated queries
  2. Unicode Normalization: NFKD normalization is fast for typical language names
  3. pycountry Lookups: Library uses internal caching, should be fast

Overall Assessment

This is a well-executed PR that meaningfully improves internationalization support. The code quality is high, with good separation of concerns, comprehensive tests, and defensive programming.

Approval Recommendation: ✅ APPROVE with minor suggestions

The issues identified are minor and don't block merging. Consider addressing the cache invalidation strategy and adding startup cache warming as follow-up improvements.

Priority of Recommendations:

  1. HIGH: Add cache warming at startup (prevents first-request latency)
  2. MEDIUM: Consider TTL-based cache expiration (operational improvement)
  3. LOW: Add docstrings and minor test cases (code quality)

Great work on improving Unicode support! The Turkish character handling and dynamic language resolution are significant improvements for international deployments.

@claude
Copy link
Copy Markdown

claude Bot commented Jan 27, 2026

Pull Request Review: Improve Language Support for Search

This PR makes important improvements to language handling in the search functionality. Overall, the implementation is well-thought-out with good test coverage. Below are my detailed findings:

✅ Strengths

  1. Comprehensive Test Coverage: Excellent test suite in test_language_utils.py covering edge cases including:

    • Case-insensitive matching
    • Locale variants (en-US, en_GB)
    • Three-letter ISO codes
    • Database error handling
    • SQL injection prevention (en;DROP)
  2. Security Improvements:

    • Input validation in _is_safe_language_code() prevents SQL injection
    • The SQL query in _get_available_search_configs_cached() is properly static with no user input (line 15)
  3. Unicode Support:

    • Proper Unicode normalization using NFKD decomposition (line 33 in language_utils.py)
    • Unicode-aware character classification using unicodedata.category() (lines 64-65 in query_parser.py)
    • Handles combining characters correctly (Mark category 'M')
  4. Graceful Degradation: Falls back to 'simple' config when:

    • Language code is empty or invalid
    • Database query fails
    • No matching config found
  5. Caching Strategy: LRU cache with maxsize=1 is appropriate since configs rarely change, and the clear_search_config_cache() function allows manual refresh.

  6. Documentation: New maintenance documentation clearly explains the need to restart RADIS after installing new Postgres configs.

🔍 Issues & Concerns

Critical Issues

  1. Inconsistent Attribute Access in Bulk Upsert (radis/reports/api/viewsets.py):

    • Line 100: existing.language = language (sets relationship)
    • Line 154: .pk used consistently for modality
    • Line 168: .pk used consistently for group
    • Issue: Mixing .id and .pk for accessing primary keys. While functionally equivalent for most Django models, .pk is more idiomatic as it works correctly even with custom primary key fields. This change is good.
    • Note: Setting existing.language = language (line 100) instead of existing.language_id = language.id is actually better practice as it properly handles the relationship object.
  2. Regex Pattern in Query Parser (radis/search/utils/query_parser.py:197):

    • Changed from pp.Word(pp.alphanums + pp.alphas8bit + "_-'") to pp.Regex(r"[^\s()]+")
    • Concern: The new pattern is extremely broad - it matches ANY non-whitespace, non-parenthesis character
    • Risk: Could allow special characters that might cause issues downstream if not properly sanitized
    • Mitigation: Input is pre-sanitized by _replace_invalid_characters() which uses is_search_query_char(), so this should be safe
    • Question: Why not use pp.Regex(r"[\w\-']+") or similar to be more explicit? The comment says "keep regex broad to avoid pyparsing Unicode limitations" - this could use more explanation.

Medium Priority Issues

  1. Cache Invalidation Strategy:

    • Cache is only cleared via manual command (refresh_search_configs)
    • If new Postgres text search configs are installed, the app won't detect them until restart or manual cache clear
    • Suggestion: Consider adding a cache TTL or detecting config changes at startup
    • Note: The maintenance docs address this, but it could still cause confusion
  2. Error Handling in code_to_language():

    • Line 84: If get_available_search_configs() returns empty set due to DB error, all language codes will fall back to 'simple'
    • This is logged but silent to the user
    • Consideration: In a production environment with temporary DB issues, all searches would fall back to 'simple' config, potentially degrading search quality
  3. Performance Consideration:

    • _language_name_candidates() (lines 45-65) does string processing and pycountry lookups on every call
    • Called within the search path via _resolve_language()
    • Suggestion: Consider caching language code to config mappings, not just available configs
    • Impact: Probably negligible for typical use, but could matter with high query volume

Minor Issues

  1. Type Safety in sanitize_term() (radis/pgsearch/providers.py:24-25):

    • Function doesn't validate input is a string
    • If term is None or non-string, the iteration will fail
    • Suggestion: Add type validation or use type hints more defensively
  2. Duplicate Code:

    • _resolve_language() is just a one-line wrapper around code_to_language()
    • Used 3 times in providers.py (lines 92, 135, 145)
    • Minor: Could inline this, but the wrapper provides good semantic clarity
  3. Test Coverage Gap:

    • No integration tests verifying that Turkish characters (like "krüşk") actually work end-to-end with PostgreSQL
    • The unit test in test_query_parser.py is good, but doesn't verify Postgres handles it correctly

📝 Code Quality Observations

  1. Line Length: Changes respect the 100-character limit (per CLAUDE.md)

  2. Naming Conventions: Follow Google Python Style Guide well

  3. Comments:

    • Line 33 comment about NFKD is helpful
    • Line 14 comment about static query is good for security audit
    • Line 196 comment could be more detailed about pyparsing limitations
  4. Error Messages: Clear and actionable (e.g., lines 92-95 in language_utils.py)

🎯 Recommendations

  1. Consider Adding:

    • Integration test that actually queries Postgres with Unicode characters
    • Benchmark for _language_name_candidates() performance with high query volume
    • Cache for language code mappings, not just available configs
  2. Documentation:

    • Expand comment on line 196 about pyparsing Unicode limitations
    • Add docstring to is_search_token_char() and is_search_query_char() explaining the character categories
  3. Future Enhancement:

    • Consider automatic cache invalidation on Postgres config changes (via triggers or periodic checks)
    • Add metrics/monitoring for language fallback rate

✨ Overall Assessment

This is a solid improvement to the codebase. The Unicode support is properly implemented, security is taken seriously, and the test coverage is comprehensive. The main concerns are:

  1. The very broad regex pattern (low risk due to pre-sanitization)
  2. Potential performance impact of repeated pycountry lookups (minor)
  3. Cache invalidation UX (documented but manual)

Recommendation: ✅ Approve with minor suggestions

The code is production-ready. The suggestions above are enhancements rather than blockers. Great work on the test coverage and graceful error handling!


Note: I couldn't run the actual tests or check CI status, so please verify all tests pass and linting is clean before merging.

@NumericalAdvantage NumericalAdvantage merged commit ca22984 into main Jan 27, 2026
3 checks passed
@NumericalAdvantage NumericalAdvantage deleted the textSearchLangSupport branch January 27, 2026 14:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants