Normalize precomposed Unicode characters. by scossu · Pull Request #189 · lcnetdev/scriptshifter

scossu · 2025-03-16T14:17:28Z

This PR adds a normalization step to both S2R and R2S. It converts all pre-composed characters in the source into their decomposed form (combining diacritic + base symbol). With this step, conversion tables only need to address tokens in the decomposed form.

If this is the intended behavior for Scriptshifter, please approve.

RandyBarry · 2025-03-16T18:42:03Z

Stefano,Decomposition is NOT always safe or appropriate when converting from script-to-Roman. For example, Cyrillic has many characters with diacritical marks that should not be encoded as separate characters, in particular the Йй Ёё. In Latin transliteration, these diacritical marks are encoded separately.Randy Barry - Рэнди Барри - ᠷᠧᠨᠳᠢ ***@***.*** Mar 16, 2025, at 10:17, Stefano Cossu ***@***.***> wrote:This PR adds a normalization step to both S2R and R2S. It converts all pre-composed characters in the source into their decomposed form (combining diacritic + base symbol). With this step, conversion tables only need to address tokens in the decomposed form. If this is the intended behavior for Scriptshifter, please approve. You can view, comment on, or merge this pull request online at: #189 Commit Summary aefb899 Normalize precomposed Unicode characters. 47df258 Fix typo. File Changes (1 file) M scriptshifter/trans.py (10) Patch Links: https://github.com/lcnetdev/scriptshifter/pull/189.patch https://github.com/lcnetdev/scriptshifter/pull/189.diff —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because your review was requested.Message ID: ***@***.***>

scossu · 2025-03-16T23:39:10Z

Would it be safe to apply to R2S only?

RandyBarry · 2025-03-17T02:26:04Z

Yes, I think normalizing precomposed to decomposed for Latin script should be safe. Since the old MARC environment is very specific, it may be best to control exactly which Latin Precomposed characters get decomposed.Randy Barry - Рэнди Барри - ᠷᠧᠨᠳᠢ ***@***.*** Mar 16, 2025, at 19:39, Stefano Cossu ***@***.***> wrote: Would it be safe to apply to R2S only?—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because your review was requested.Message ID: ***@***.***> scossu left a comment (lcnetdev/scriptshifter#189) Would it be safe to apply to R2S only? —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because your review was requested.Message ID: ***@***.***>

scossu added 2 commits March 16, 2025 10:06

Normalize precomposed Unicode characters.

aefb899

Fix typo.

47df258

scossu requested review from RandyBarry and thisismattmiller March 16, 2025 14:17

Decompose input only in R2S.

bc8533c

thisismattmiller approved these changes Mar 26, 2025

View reviewed changes

scossu merged commit 5eea9a9 into main Mar 26, 2025

scossu deleted the decompose branch July 12, 2025 23:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Normalize precomposed Unicode characters.#189

Normalize precomposed Unicode characters.#189
scossu merged 3 commits intomainfrom
decompose

scossu commented Mar 16, 2025

Uh oh!

RandyBarry commented Mar 16, 2025 via email

Uh oh!

scossu commented Mar 16, 2025

Uh oh!

RandyBarry commented Mar 17, 2025 via email

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

scossu commented Mar 16, 2025

Uh oh!

RandyBarry commented Mar 16, 2025 via email

Uh oh!

scossu commented Mar 16, 2025

Uh oh!

RandyBarry commented Mar 17, 2025 via email

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants