Conversation
Collaborator
|
Stefano,Decomposition is NOT always safe or appropriate when converting from script-to-Roman. For example, Cyrillic has many characters with diacritical marks that should not be encoded as separate characters, in particular the Йй Ёё. In Latin transliteration, these diacritical marks are encoded separately.Randy Barry - Рэнди Барри - ᠷᠧᠨᠳᠢ ***@***.*** Mar 16, 2025, at 10:17, Stefano Cossu ***@***.***> wrote:This PR adds a normalization step to both S2R and R2S. It converts all pre-composed characters in the source into their decomposed form (combining diacritic + base symbol). With this step, conversion tables only need to address tokens in the decomposed form.
If this is the intended behavior for Scriptshifter, please approve.
You can view, comment on, or merge this pull request online at:
#189
Commit Summary
aefb899 Normalize precomposed Unicode characters.
47df258 Fix typo.
File Changes (1 file)
M
scriptshifter/trans.py
(10)
Patch Links:
https://github.com/lcnetdev/scriptshifter/pull/189.patch
https://github.com/lcnetdev/scriptshifter/pull/189.diff
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because your review was requested.Message ID: ***@***.***>
|
Collaborator
Author
|
Would it be safe to apply to R2S only? |
Collaborator
|
Yes, I think normalizing precomposed to decomposed for Latin script should be safe. Since the old MARC environment is very specific, it may be best to control exactly which Latin Precomposed characters get decomposed.Randy Barry - Рэнди Барри - ᠷᠧᠨᠳᠢ ***@***.*** Mar 16, 2025, at 19:39, Stefano Cossu ***@***.***> wrote:
Would it be safe to apply to R2S only?—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because your review was requested.Message ID: ***@***.***>
scossu left a comment (lcnetdev/scriptshifter#189)
Would it be safe to apply to R2S only?
—Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because your review was requested.Message ID: ***@***.***>
|
thisismattmiller
approved these changes
Mar 26, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR adds a normalization step to both S2R and R2S. It converts all pre-composed characters in the source into their decomposed form (combining diacritic + base symbol). With this step, conversion tables only need to address tokens in the decomposed form.
If this is the intended behavior for Scriptshifter, please approve.