Add grammatical form equivalence checking for cross-linguistic case aliases#362
Add grammatical form equivalence checking for cross-linguistic case aliases#362
Conversation
Introduces langtools.form_equivalences with normalize_grammatical_form() and are_grammatical_forms_equivalent() to handle cases where different term conventions describe the same grammatical form. The first equivalence defined is Lithuanian locative = inessive: the traditional Lithuanian grammar term "locative" (vietininkas) is the same case that cross-linguistic typologists and LLMs familiar with Finnish/ Estonian/Hungarian often call "inessive". The 0062 sentence-decomposition benchmark scorer now calls are_grammatical_forms_equivalent() inside _grammatical_form_similarity(), so noun/lt_locative_singular and noun/lt_inessive_singular (or any reordering such as noun/lt_singular_inessive) all receive full credit. https://claude.ai/code/session_01XLApSPLmn4brnG74uriPkQ
|
@codex for review - is this needed? We could have per-language context to avoid needing to convert output. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 07525d9898
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
src/langtools/form_equivalences.py
Outdated
| import re | ||
| from typing import Dict, List, Optional, Tuple | ||
|
|
||
| from langtools.lt.case_equivalences import LT_CASE_ALIASES |
There was a problem hiding this comment.
Avoid importing LT package for static alias map
Importing LT_CASE_ALIASES via langtools.lt.case_equivalences forces Python to execute langtools/lt/__init__.py first, which eagerly imports Wiktionary-related modules (langtools.lt.types and then clients.wiktionary.client) and pulls in extra runtime dependencies unrelated to form comparison. As a result, using langtools.form_equivalences (and now sentence decomposition scoring) can fail at import time in environments that only need benchmark scoring but not the Wiktionary stack; this alias table should live in a dependency-light module outside the langtools.lt package init path.
Useful? React with 👍 / 👎.
form_equivalences.py now reads langtools/<lang>/case_equivalences.py
on first use via importlib.util.spec_from_file_location, bypassing
the language package's __init__.py entirely. This means benchmark
scoring environments that only need form comparison never pay the cost
of the Wiktionary/SQLAlchemy stack that langtools.lt.__init__ pulls in.
langtools/lt/case_equivalences.py is re-introduced as a plain,
import-free data file exposing CASE_ALIASES = {"inessive": "locative"}.
Any language can add its own case_equivalences.py without editing
form_equivalences.py.
https://claude.ai/code/session_01XLApSPLmn4brnG74uriPkQ
Summary
This PR introduces a new module for normalizing and comparing grammatical form strings, with support for language-specific case name aliases. This enables the scoring system to recognize when different linguistic terminology refers to the same grammatical form (e.g., Lithuanian locative vs. cross-linguistic inessive).
Key Changes
New module:
langtools/form_equivalences.pynormalize_grammatical_form(): Resolves language-specific case aliases and reorders components to a canonical order (case → number → gender)are_grammatical_forms_equivalent(): Compares two form strings accounting for aliases and component orderingrole/lang_component_componentformatNew module:
langtools/lt/case_equivalences.pyUpdated:
benchmarks/lib/runners/sentence_decomposition_runner.pyare_grammatical_forms_equivalent()into_grammatical_form_similarity()methodUpdated:
tests/benchmarks/test_sentence_decomposition_scoring.pytest_0062_lt_inessive_scores_same_as_locative()to verify Lithuanian case equivalence scoringNew test file:
tests/langtools/test_form_equivalences.pyImplementation Details
_LANG_CASE_ALIASESdictionaryhttps://claude.ai/code/session_01XLApSPLmn4brnG74uriPkQ