Skip to content

Commit 9905f89

Browse files
committed
Added some notes on the degradation of Tibetan
1 parent 2535866 commit 9905f89

1 file changed

Lines changed: 26 additions & 0 deletions

File tree

td-02014-04-comments.txt

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -248,3 +248,29 @@ Arabic
248248

249249
Several comments and ideas on the degradation of Arabic are contained in
250250
separate files: https://github.com/pandey/degrade
251+
252+
253+
--------------------------------------------------------------------------------
254+
255+
Tibetan
256+
257+
The degradation of Tibetan is quite lossy. The encoding model for Tibetan has
258+
a full set of subjoined characters for representing all consonants when they
259+
occur in clusters. The virama model does not apply. These subjoined characters,
260+
such as ྲ U+0FB2 TIBETAN SUBJOINED LETTER RA, are defined in the UCD as
261+
being of the general category "Mn" (non-spacing Mark). As the td algorithm
262+
strips away all combining characters, it also removes all subjoined letters.
263+
Compare the following original text on the left and the degraded form to
264+
the left:
265+
266+
གྲྱུ -> ག
267+
268+
The encoded sequence for གྲྱུ is: < ག U+0F42 TIBETAN LETTER GA, ྲ U+0FB2 TIBETAN
269+
SUBJOINED LETTER RA, ྱ U+0FB1 TIBETAN SUBJOINED LETTER YA, ུ U+0F74 TIBETAN
270+
VOWEL SIGN U>.
271+
272+
The degradation for Tibetan should retain all subjoined letters. Or
273+
if more elemental texts are required, then all subjoined letters
274+
should be converted to the corresponding regular form.
275+
276+
Secondly, vowel signs should probably be retained.

0 commit comments

Comments
 (0)