Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarification on Asian Cyrillic mappings #3

Open
andjc opened this issue Dec 14, 2022 · 12 comments
Open

Clarification on Asian Cyrillic mappings #3

andjc opened this issue Dec 14, 2022 · 12 comments
Assignees

Comments

@andjc
Copy link

andjc commented Dec 14, 2022

The various ALA_LC romanisation tables for Cyrillic script languages have complex encoding issues, and it seems that the mappings in the file below diverge from the intend of the romanisation tables.

In the Asian Cyriillic mapping at

# CONVERION OF "I/i" LIGATED TO "E/e", SOME WITH MACRON (0304) AND OGONEK (0328)
"\u0464": "I\uFE20E\uFE21\u0304"
"\u0468": "I\uFE20E\uFE21\u0328"

Are there the correct mappings?

Let's take a look at “Ѩ” (U+0468) which in the file maps to I\uFE20E\uFE21\u0328, this is the unnormalised form. \uFE21 and \u0328 have different canonical classes and when you normalise them to NFD, the diacritics will be canonically ordered, yielding I\uFE20E\u0328\uFE21.

With “Ѥ” (U+0464) is mapped in the file to I\uFE20E\uFE21\u0304. \uFE21 and \u0304 belong to the same combining class, so the diacritics interact typographically and the two orders are not canonically equivalent. Order matters, and the two sequences constitute different graphemes or perceivable characters.

So visually (with a properly designed font for these characters)

I\uFE20E\uFE21\u0304 would render as the letter I with a left half ligature tie directly above the I, followed by an E with a right half ligature tie directly above the E, and a macron above the right ligature tie centred on the right ligature tie and the E.

While I\uFE20E\u0304\uFE21 would render as the letter I with a left half ligature tie directly above the I, followed by an E macron with a right half ligature tie directly above the E macron.

The sequences are not normalised, so I am wondering what the intended sequence is?

This gets much more complicated with

"\u04B4": "T\uFE20S\uFE21\u0307"

This sequence should render as the letter T with the left half ligature tie positioned directly above the T, followed by S with a right ligature tie positioned above it, and a dot-above positioned above the right half ligature tie, i.e. centred above the S.

It is important to note that U+0307 belongs to S + U+FE21, and not to T + U+FE20 + S + U+FE21. I assume what you are trying to map here is the sequence equivalent to T + ◌͡ + CGJ +◌̇ + S, i.e. U+0054 U+0361 U+034F U+0307 U+0053

The current mapping using ligature ties is not equivalent to the double diacritic sequence U+0054 U+0361 U+0053 U+0307 or U+0054 U+0361 U+034F U+0307 U+0053.

If we map half ligature tie forms to double spanning diacritic equivalents, we get the following:

Half forms Double forms
T + U+FE20 + S + U+0307 + U+FE21 T + U+0361 + S + U+0307
T + U+FE20 + S + U+FE21 + U+0307 Undefined
Undefined T + U+0361 + U+034F + U+0307 + S

compare with:

Half forms Double forms
I + U+FE20 + E + U+FE21 + U+0328 I + U+0361 + E + U+0328
I + U+FE20 + E + U+0328 + U+FE21 I + U+0361 + E + U+0328

Clarifications on intend and usage would be useful.

I do understand that bibliographic data can be dirty, and this is reflected in the Latin to Cyrillic mappings, but I am curious about the Cyrillic to Latin mappings mentioned above.

@scossu
Copy link
Collaborator

scossu commented Mar 16, 2023

I think I understand the issue at hand and it seems like a configuration setup issue, which can be resolved by changing the script sequences. @thisismattmiller @kefo Let me know if some changes to the logic are otherwise needed.

@andjc
Copy link
Author

andjc commented Mar 20, 2023

@scossu trying to reword the issue for clarity. The current mapping involves conceptual errors in the Unicode sequences that represent some of the romanisations in the mapping file, leading to incorrect character sequences. Some of the romanisations can only be represented using the double spanning diacritic, not the paired half marks.

@RandyBarry
Copy link
Collaborator

The Asian Cyrillic mapping was an attempt to provide for conversion between ALL possible Unicode Cyrillic characters and unique Latin transliterations. I used the existing ALA-LC Romanization Tables as a starting point BUT those tables are inconsistent between scripts. A unique Unicode Cyrillic character is sometimes mapped to different Latin transliterations, depending upon the language involved. This is a weakness in the ALA-LC schemes. I created and use the Asian Cyrillic mapping mostly for Russian publications involving minority languages. It works fairly well. My long term hope is to suggest changes to a small number of ALA-LC Romanization Table to make mappings between Cyrillic and Latin consistent. Some of the problematic characters involved are the Cyrillic A, O, and У with dieresis which are mapped sometime to the Latin A, O, and U with a breve or dot above. It would seem more logical and intuitive to map them to the Latin letters + the combining dieresis. That would leave the Latin A, O, and U with some other mark to be used for less intuitive mappings, like the Cyrillic Ә, Ө and Ү. Obviously, the Asian Cyrillic ScriptShifter option is not ideal since does not always convert Latin or Cyrillic to the expected equivalents prescribed by some language-specific tables. I had intended the Asian Cyrillic option to be a "use at your own risk". I would like to have accompanying document that shows what the mapping of Cyrillic to Latin actually is for this superset. That mapping will drive any recommendations I ever make to adjust existing ALA-LC tables to a more consistent transliteration. One of the biggest problems is actual the regular Cyrillic letter for the Latin "G/g" which in some Slavic languages is mapped to Latin "H/h". I doubt the library community will ever agree to changing the ALA-LC table for Belarusian and Ukrainian to map "Г/г᠌" and "G/g" and the special Cyrillic G/g to some modified Latin G/g. That would be the ideal solution for that one letter. People who work with various Slavic and non-Slavic languages are probably aware of the traps in the ALA-LC tables. The Bulgarian "u+breve" is another problem/inconsistent mapping between Cyrillic and Latin scripts.

@andjc
Copy link
Author

andjc commented Dec 5, 2023

@RandyBarry a couple of thoughts, the ALA-LC Romanization tables aren't technically transliteration tables. They share more in common with language specific transcription schemes rather than transliteration. Cyrillic is spread across the Azerbaijani, Belarusian, Bulgarian, Church Slavic, Kurdish, Macedonian, Non-Slavic Languages (in Cyrillic Script),Romanian (in Cyrillic), Russian, Rusyn / Carpatho-Rusyn, Ukrainian, and Uzbek tables. Which makes for many competing interpretations of a character and its mapping.

The Non-Slavic Languages (in Cyrillic Script) table is actually multiple tables, a core table shared across the languages covered by the document and then a series of language specific tables containing amendments to the core table for that specific language.

It sounds like you need a script level transliteration scheme rather than the existing ALA-LC romanisation tables. A transliteration system would be a very different beast to the ALA-LC romanisation tables and much easier to process.

Unlike a transliteration system, a transcription system tends to reflect the language and does not need to be consistent across languages that use the script.

@RandyBarry
Copy link
Collaborator

RandyBarry commented Dec 6, 2023 via email

@scossu
Copy link
Collaborator

scossu commented Dec 6, 2023

@andjc You might have noticed that a whole new set of non-Slavic Cyrillic tables has been released thanks to @RandyBarry ): https://github.com/lcnetdev/scriptshifter/tree/main/scriptshifter/tables/data This might not have been released on the LC Scriptshifter service yet, but it should be available soon.

@andjc
Copy link
Author

andjc commented Dec 6, 2023

@andjc You might have noticed that a whole new set of non-Slavic Cyrillic tables has been released thanks to @RandyBarry ):

Thanks. Will look at them.

@andjc
Copy link
Author

andjc commented Dec 6, 2023

@scossu , @RandyBarry one core problem with the Cyrillic romanisations and their mappings here is the use of U+FE20 and U+FE21.

In the MARC-8 to Unicode mapping tables, xEB and xEC are mapped to U+0361, rather than U+FE20 and U+FE21. The use of U+0361 is recommended as teh appropriate mapping. There are two reasons for this:

  1. Unicode itself recommends U+0361 over U+FE20 and U+FE21, and
  2. Not all romanisaton sequences can be encoded using U+FE20 and U+FE21. This is especially relevant to ALA-LC Cyrillic romanisations.

Some time ago i threw together a rough draft of a document discussing the encoding model of Cyrillic romanisations.

If we take a look at the abkhaz_cyrillic.yml file, the encoding and romanisation fun begins with Ҵ and ҵ, which romanise to T͡͏̇S, T͡͏̇s, and t͡͏̇s.

The valid Unicode representations for the romanisation of Ҵ and ҵ are <U+0054, U+0361, U+034F, U+0307, U+0053>, <U+0054, U+0361, U+034F, U+0307, U+0073>, and <U+0074, U+0361, U+034F, U+0307, U+0073> . I.e. <T + ◌͡ + CGJ + ◌̇ + S>, <T + ◌͡ + CGJ + ◌̇ + s>, and <t + ◌͡ + CGJ + ◌̇ + s>.

"\u04B4": "T\uFE20S\uFE21\u0307"
"\u04B5": "t\uFE20s\uFE21\u0307"

In the sequence "T\uFE20S\uFE21\u0307" The dot-above is anchored to the right half ligature which in turn is anchored to the S. So typographically, and according to the Unicode character model, the dot-above is positioned centrally above the S and right half ligature. It is NOT applied to the whole TS ligature. The romanisation for Ҵ and ҵ can not be formed or specified in Unicode using U+FE20 and U+FE21. The only way the romanisation can be represented is by using <U+0361, U+034F, U+0307> instead which anchors U+0307 to U+034F.

Likewise with "t\uFE20s\uFE21\u0307".

They are character sequences that do not represent and can not represent the romanisations of Ҵ and ҵ.

In

"T\uFE20S\uFE21\u0307": "\u04B4"
"T\uFE20s\uFE21\u0307": "\u04B4"
"t\uFE20s\uFE21\u0307": "\u04B5"
"T\uFE20\u0307S\uFE21": "\u04B4"
"T\uFE20\u0307s\uFE21": "\u04B4"
"t\uFE20\u0307s\uFE21": "\u04B5"
"T\u0307\uFE20S\uFE21": "\u04B4"
"T\u0307\uFE20s\uFE21": "\u04B4"
"t\u0307\uFE20s\uFE21": "\u04B5"

I would consider these lines as repair work, i.e. mapping malformed romanisation sequences to the correct Cyrillic characters, i.e. data repair.

@andjc
Copy link
Author

andjc commented Dec 6, 2023

@andj:Agreed. The ALA-LC Romanization Tables is a collection of inconsistent transcription schemes. My opinion is that too much emphasis was placed on pronunciation close to the source language rather than an easily reversible script-consistent transliteration from a source script to Latin script. I assume this is widely recognized by script specialists. Nonetheless, ScriptShifter is being developed to allow catalogers who apply the ALA-LC Romanization Tables to convert between as many scripts as possible. Its design should allow for conversion options beyond the limited and admittedly flawed ALA-LC schemes. It remains to be seen what the final set of conversion options will grow to be. By the way, ScriptShifter is to replace an even more limited tool whose shortcomings have worried me for years. I believe we’ve made significant progress with ScriotShifter. I’m happy to see a wider user base that will challenge and help improve the tool.

There are a range of issues with romanisation, my current favourite is the divergence of romanisation practice in Lao and Thai after the MARC-8 info on codepoints used was removed form the romanisation tables. This inadvertently changed the romanisations used, and resulted in divergent practices between cataloguers or institutions, and with copy cataloguing a library could have romanisations using two different characters based on which table and interpretation is used in a record. See a note I put together.

One of the side projects I am working on is a python tool to repair internationalisation issues in bibliographic data. Data from a Voyager installation is particularly fun, since Voyager uses CESU-8 instead of UTF-8, and doesn't handle encoding conversion when exporting records, and different methods of retrieving or exporting the records has different effects on the data. Mainly effects characters in the SMP, SIP and TIP.

Then changes to how Alif and Ayn were mapped to Unicode, and that older data may not have been converted. Lots of other issues as well. The scriptshifter mapping files are useful in identifying character sequences that should be repaired.

@andjc
Copy link
Author

andjc commented Dec 6, 2023

The appropriate link for the note on the MARC-8 to Unicode mapping discussed above is note 1 on the ANSEL/Extended Latin mapping table:

Note 1: The Ligature that spans two characters is constructed of two halves in MARC-8: EB (Ligature, first half) and EC (Ligature, second half). The preferred Unicode/UTF-8 mapping is to the single character Ligature that spans two characters, U+0361. The single character Ligature is encoded between the two characters to be spanned. The two half Ligatures in Unicode, to which the Ligature has been mapped since 1996, are indicated in the mapping as alternatives, but their use is not recommended. It is expected that font support for the single character Ligature mark will be more easily obtained than for the two halves.

But then again, Libraries are very bad at ensuring appropriate fonts are used for rendering bibliographic data, that I suspect is one of the underlying reasons for the confusion in the Lao/Thai romanisation, i.e. the tables used the wrong glyph , because they didn't have a font available at the time to display the glyph they specified via MARC-8 mapping.

@scossu
Copy link
Collaborator

scossu commented Dec 7, 2023

@andjc Thanks for the feedback. Is there an open action item that this issue is requesting, or can I move to the discussion section?

@andjc
Copy link
Author

andjc commented Dec 7, 2023

@scossu ,

  1. for Latin -> Cyrillic rules include both U+FE20 / U+FE21 and U+0361 sequences. This direction should cover both possibilities.

  2. for Cyrillic to Latin, my gut reaction is to bring it inline with the existing MARC-8 to Unicode mapping, and replace U+FE20 / U+FE21 with U+0361 sequences.

But for 2) this will be a change to how some, maybe many, institutions currently handle things. The Cyrillic - Latin mappings listed in comments above are not valid Unicode sequences for the transliteration required. So should change. But I suspect some users may want to remain with broken data. Step 1) is practical. Covers both bases. Step 2) is correct but may be controversial considering how conservative things can be.

Relevant parts of the Unicode standard, including the use of CGJ with double diacritics and the section on combining half marks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants