Hello! How about adding BOCU-1 compression for Unicode multilingual messages?
Here is the description: https://www.unicode.org/notes/tn6/
Here are a few examples:
Это было моё первое принятое сообщение... 👩👨🧑👧🧒👨🦱👸🤶👮♀️🕵️♀️
input: 143 UTF-8 bytes
output: 90 BOCU-1 bytes
BOCU-1/UTF-8: 0.629371
Заменил антенну на 5-ягу. Если кто через нее ходит, дайте обратную связь
input: 129 UTF-8 bytes
output: 79 BOCU-1 bytes
BOCU-1/UTF-8: 0.612403
Доброго утра всем! 17,5 ° C и солнце 📡 )))
input: 68 UTF-8 bytes
output: 51 BOCU-1 bytes
BOCU-1/UTF-8: 0.750000
Погодка КАЙФ
input: 23 UTF-8 bytes
output: 13 BOCU-1 bytes
BOCU-1/UTF-8: 0.565217
Первый рабочий день после длинных выходных
input: 79 UTF-8 bytes
output: 43 BOCU-1 bytes
BOCU-1/UTF-8: 0.544304
You can download and test the Win32 console implementation here: https://www.unicode.org/notes/tn6/bocu1.exe
Alternatively, there is the «UCF» encoding, which also resolves the issue of bloated file sizes caused by characters outside the a-z range: https://github.com/hyoo-ru/mam_mol/tree/master/charset/ucf
Hello! How about adding BOCU-1 compression for Unicode multilingual messages?
Here is the description: https://www.unicode.org/notes/tn6/
Here are a few examples:
You can download and test the Win32 console implementation here: https://www.unicode.org/notes/tn6/bocu1.exe
Alternatively, there is the «UCF» encoding, which also resolves the issue of bloated file sizes caused by characters outside the a-z range: https://github.com/hyoo-ru/mam_mol/tree/master/charset/ucf