File tree Expand file tree Collapse file tree
Expand file tree Collapse file tree Original file line number Diff line number Diff line change @@ -248,3 +248,29 @@ Arabic
248248
249249Several comments and ideas on the degradation of Arabic are contained in
250250separate files: https://github.com/pandey/degrade
251+
252+
253+ --------------------------------------------------------------------------------
254+
255+ Tibetan
256+
257+ The degradation of Tibetan is quite lossy. The encoding model for Tibetan has
258+ a full set of subjoined characters for representing all consonants when they
259+ occur in clusters. The virama model does not apply. These subjoined characters,
260+ such as ྲ U+0FB2 TIBETAN SUBJOINED LETTER RA, are defined in the UCD as
261+ being of the general category "Mn" (non-spacing Mark). As the td algorithm
262+ strips away all combining characters, it also removes all subjoined letters.
263+ Compare the following original text on the left and the degraded form to
264+ the left:
265+
266+ གྲྱུ -> ག
267+
268+ The encoded sequence for གྲྱུ is: < ག U+0F42 TIBETAN LETTER GA, ྲ U+0FB2 TIBETAN
269+ SUBJOINED LETTER RA, ྱ U+0FB1 TIBETAN SUBJOINED LETTER YA, ུ U+0F74 TIBETAN
270+ VOWEL SIGN U>.
271+
272+ The degradation for Tibetan should retain all subjoined letters. Or
273+ if more elemental texts are required, then all subjoined letters
274+ should be converted to the corresponding regular form.
275+
276+ Secondly, vowel signs should probably be retained.
You can’t perform that action at this time.
0 commit comments