How to handle transliteration in some languages? #1

thadguidry · 2020-04-28T16:28:13Z

Pinyin (Pin Yin "spell sound") is a transliteration to handle Romanization for Chinese Mandarin.

Example: https://www.wikidata.org/wiki/Property:P1721

The option of transliteration (in BOLD) is shown in the following examples:

water -> shuǐ -> 水
liquid water -> yètài shuǐ -> 液态水

水 -> shuǐ -> water
液态水 -> yètài shuǐ -> liquid water

Perhaps it's best that this is read from mappings already directly applied to Chinese Lexeme Senses as demonstrated here:

https://www.wikidata.org/wiki/Lexeme:L8219#S1

Translations are covered by Wikidata's Sense Statements as evidenced here:
https://www.wikidata.org/wiki/Lexeme:L3302

But Transliterations (Romanizations) are not documented well on Wikidata, it seems currently.
This is probably a documentation improvement that is needed on Wikidata's side for "How best to apply transliteration for Lexemes and Senses"?

References:
"water" en Sense https://www.wikidata.org/wiki/Lexeme:L3302
"liquid water" en Concept https://www.wikidata.org/wiki/Q29053744
"liquid" en Concept https://www.wikidata.org/wiki/Q11435

vrandezo · 2020-05-02T03:43:07Z

Thanks for opening the issue (and yay, Issue #1!!)

And I have to admit, I am not sure I understand the issue. This is because I really don't understand how Chinese works, so my answer might be entirely besides the point.

So I will rephrase how I understand the question and then answer that question. Please don't let me get away with it if I entirely missed the point.

Serbian Wikipedia, for example, uses two scripts (so do a few others, such as Uzbek, Tatar, etc.). And the question is how would Abstract Wikipedia support both of those scripts?

In Serbian, the situation is particularly simple: the latin transliteration can be generated from a cyrillic input easily. So it is possible to simply generate a cyrillic output, and then, at the very end, just run a transliteration function over the resulting string that translates the string to latin.

This does not always work: for example, the reverse wouldn't be as trivial, because Њ transliterates to nj, but the two letters n and j transliterate to н and ј respectively. In that case we would need to retain the information whether these are the two letters n and j which happen to be next to each other or whether it is the digraph nj.

This can be done by either creating a slightly abstract output that retains this information with a special token, and then use a final pass over the result that removes these tokens and replaces them with the concrete letters, or by rewriting the functions so that they take the script as a parameter and push this knowledge deeper into the function stack.

Either of the solutions would be possible, and the respective language community can decide which one makes more sense for their particular language (in fact, this could get a far way to solve the differences between standard Croatian and Serbian).

So, I hope that this answer somehow applies to your question. If it doesn't please let me know and give me a bit more background. Thank you!

thadguidry · 2020-05-02T15:46:22Z

Yes it answers it partially.
I completely understand that functions could read information from lots of places.
The question is WHERE is the information stored (best).

So my only question is about Wikidata Lexeme's themselves storing that information of transliteration maps and how best to store it, so that Abstract Text functions can read it properly.

Where in the Lexeme ecosystem would the transliteration mapping be applied that functions could read from? Would it be on the ZH entities? or the EN entities? or both? or somewhere else?
Would P1721 "pinyin transliteration" be used always as a qualifier within the translation statement? Ex: https://www.wikidata.org/wiki/Lexeme:L3302

Or use P1721 "pinyin transleteration" as a direct statement on the ZH entity (which mimics how input systems work)? Ex: https://www.wikidata.org/wiki/Lexeme:L8219

vrandezo · 2020-05-02T17:57:30Z

As I said, I really am not sufficiently knowledgeable about Chinese.

If I understand it correctly, and the transliteration is always the same for a given Chinese lexeme, and does not differ based on Sense or Form, then I would think that it makes more sense as a statement on the Chinese lexeme (as in your last screenshot).

If it is on the translation of the English lexeme for water, it looks like it is a denormalization - that date should not be a qualifier on that translation, as in your first screenshot, that doesn't look right to me. This would lead to a lot of duplication.

thadguidry · 2020-05-02T18:18:03Z

date?

thadguidry · 2020-05-02T19:19:54Z

Here's a transliteration map from OpenVanilla.org
Hopefully this clarifies the question for you to offer good advice... and then we can close this issue out after you respond.

shui 水  - water
shui 说  - talk
shui 谁  - who
shui 睡  - sleep
shui 税  - tax

vrandezo · 2020-05-03T03:35:35Z

"date" - mistyped, I meant datum, snak, or piece of information.

vrandezo · 2020-05-03T03:40:45Z

Regarding the example you showed:

It looks there as if every Chinese character only has a single Pinyin transliteration into, but that the result of that is not reversible, i.e. the same string in latin script is ambiguous when translated back to Chinese characters.

That would indicate that it would make sense to have the render function for Chinese create Chinese characters, and if a transliteration into pinyin is desired, a function can run on top of that.

So I still think that my last comment holds: it looks like the pinyin form should be on the lexeme representing the Chinese character, not on the statement offering a translation coming from the English (or any other) noun.

Also, I think that this discussion probably would make more sense on Wikidata itself. I wouldn't want the modelling of Wikidata be affected by a possible future implementation of a project proposal. That seems premature :)

Feel free to close this if this satisfies your question.

thadguidry · 2020-05-03T14:39:04Z

@vrandezo Thanks Denny. I also agree with you, since Pinyin is an input format, not an output, so its not reversible. (I just wanted to make sure there wasn't something I was missing conceptually from the AbstractText effort regarding transliteration handling. Thanks for explaining!)

Regarding Chinese example... the talk, sleep, and tax are all pronounced the same in Chinese, and relies on sentence context. water and who have different pronunciation. All 5 use "shui" to type into input systems where a user is usually given a popup choice for which Chinese lexeme they are meaning from the English Pinyin input.

Closing this issue now, since we have the use case and our agreed probable handling for it as a reference.

updating from Denny's branch

thadguidry closed this as completed May 3, 2020

vrandezo pushed a commit that referenced this issue May 5, 2020

Merge pull request #1 from google/master

c27794d

updating from Denny's branch

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to handle transliteration in some languages? #1

How to handle transliteration in some languages? #1

thadguidry commented Apr 28, 2020

vrandezo commented May 2, 2020

thadguidry commented May 2, 2020 •

edited

vrandezo commented May 2, 2020

thadguidry commented May 2, 2020

thadguidry commented May 2, 2020

vrandezo commented May 3, 2020

vrandezo commented May 3, 2020

thadguidry commented May 3, 2020 •

edited

How to handle transliteration in some languages? #1

How to handle transliteration in some languages? #1

Comments

thadguidry commented Apr 28, 2020

vrandezo commented May 2, 2020

thadguidry commented May 2, 2020 • edited

vrandezo commented May 2, 2020

thadguidry commented May 2, 2020

thadguidry commented May 2, 2020

vrandezo commented May 3, 2020

vrandezo commented May 3, 2020

thadguidry commented May 3, 2020 • edited

thadguidry commented May 2, 2020 •

edited

thadguidry commented May 3, 2020 •

edited