Skip to content

Add grammatical form equivalence checking for cross-linguistic case aliases#362

Open
powera wants to merge 2 commits intomainfrom
claude/lithuanian-case-support-IQuyQ
Open

Add grammatical form equivalence checking for cross-linguistic case aliases#362
powera wants to merge 2 commits intomainfrom
claude/lithuanian-case-support-IQuyQ

Conversation

@powera
Copy link
Owner

@powera powera commented Feb 25, 2026

Summary

This PR introduces a new module for normalizing and comparing grammatical form strings, with support for language-specific case name aliases. This enables the scoring system to recognize when different linguistic terminology refers to the same grammatical form (e.g., Lithuanian locative vs. cross-linguistic inessive).

Key Changes

  • New module: langtools/form_equivalences.py

    • normalize_grammatical_form(): Resolves language-specific case aliases and reorders components to a canonical order (case → number → gender)
    • are_grammatical_forms_equivalent(): Compares two form strings accounting for aliases and component ordering
    • Handles parsing of form strings in the role/lang_component_component format
  • New module: langtools/lt/case_equivalences.py

    • Defines Lithuanian case aliases mapping "inessive" → "locative"
    • Includes documentation explaining the linguistic rationale (Lithuanian locative is functionally equivalent to the cross-linguistic inessive case)
  • Updated: benchmarks/lib/runners/sentence_decomposition_runner.py

    • Integrated are_grammatical_forms_equivalent() into _grammatical_form_similarity() method
    • Now returns full credit (1.0) when forms are equivalent according to language-specific rules
    • Applied code formatting improvements (line length, consistency)
  • Updated: tests/benchmarks/test_sentence_decomposition_scoring.py

    • Added test_0062_lt_inessive_scores_same_as_locative() to verify Lithuanian case equivalence scoring
    • Applied code formatting improvements for consistency
  • New test file: tests/langtools/test_form_equivalences.py

    • Comprehensive test coverage for normalization and equivalence checking
    • Tests alias resolution, component reordering, case-insensitivity, and language-specific behavior

Implementation Details

  • Form normalization uses regex parsing to extract role, language code, and components
  • Components are categorized (case/other, number, gender) and reordered to ensure consistent canonical form
  • Language-specific aliases are pluggable via the _LANG_CASE_ALIASES dictionary
  • All form string comparisons are case-insensitive
  • Forms without language prefixes (e.g., "preposition/base") are returned unchanged

https://claude.ai/code/session_01XLApSPLmn4brnG74uriPkQ

Introduces langtools.form_equivalences with normalize_grammatical_form()
and are_grammatical_forms_equivalent() to handle cases where different
term conventions describe the same grammatical form.

The first equivalence defined is Lithuanian locative = inessive: the
traditional Lithuanian grammar term "locative" (vietininkas) is the same
case that cross-linguistic typologists and LLMs familiar with Finnish/
Estonian/Hungarian often call "inessive".

The 0062 sentence-decomposition benchmark scorer now calls
are_grammatical_forms_equivalent() inside _grammatical_form_similarity(),
so noun/lt_locative_singular and noun/lt_inessive_singular (or any
reordering such as noun/lt_singular_inessive) all receive full credit.

https://claude.ai/code/session_01XLApSPLmn4brnG74uriPkQ
@powera
Copy link
Owner Author

powera commented Feb 25, 2026

@codex for review - is this needed? We could have per-language context to avoid needing to convert output.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 07525d9898

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

import re
from typing import Dict, List, Optional, Tuple

from langtools.lt.case_equivalences import LT_CASE_ALIASES

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Avoid importing LT package for static alias map

Importing LT_CASE_ALIASES via langtools.lt.case_equivalences forces Python to execute langtools/lt/__init__.py first, which eagerly imports Wiktionary-related modules (langtools.lt.types and then clients.wiktionary.client) and pulls in extra runtime dependencies unrelated to form comparison. As a result, using langtools.form_equivalences (and now sentence decomposition scoring) can fail at import time in environments that only need benchmark scoring but not the Wiktionary stack; this alias table should live in a dependency-light module outside the langtools.lt package init path.

Useful? React with 👍 / 👎.

form_equivalences.py now reads langtools/<lang>/case_equivalences.py
on first use via importlib.util.spec_from_file_location, bypassing
the language package's __init__.py entirely.  This means benchmark
scoring environments that only need form comparison never pay the cost
of the Wiktionary/SQLAlchemy stack that langtools.lt.__init__ pulls in.

langtools/lt/case_equivalences.py is re-introduced as a plain,
import-free data file exposing CASE_ALIASES = {"inessive": "locative"}.
Any language can add its own case_equivalences.py without editing
form_equivalences.py.

https://claude.ai/code/session_01XLApSPLmn4brnG74uriPkQ
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants