New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

implement break_wide #625

Merged
merged 1 commit into from Jan 21, 2017

Conversation

Projects
None yet
3 participants
@ailin-nemui
Contributor

ailin-nemui commented Jan 20, 2017

for more pleasant east asian mixed display

implement break_wide
for more pleasant east asian mixed display
@ailin-nemui

This comment has been minimized.

Contributor

ailin-nemui commented Jan 20, 2017

Example

off (current way)

╔════════════════════════════════════════════════════════════════════════════════╗ 
║10:31 -!- 你好你好你好你好你好你好你好你好阿宝ab                                ║ 
║          cd你好你好你好你好你好你好你好你好你好你好你好你好你好你好阿宝ab      ║ 
║          cd你好你好你好你好你好你好                                            ║ 
╚════════════════════════════════════════════════════════════════════════════════╝ 

on

╔════════════════════════════════════════════════════════════════════════════════╗ 
║10:31 -!- 你好你好你好你好你好你好你好你好阿宝ab cd你好你好你好你好你好你好你好 ║ 
║          你好你好你好你好你好你好你好阿宝ab cd你好你好你好你好你好你好         ║ 
╚════════════════════════════════════════════════════════════════════════════════╝ 
@ailin-nemui

This comment has been minimized.

Contributor

ailin-nemui commented Jan 20, 2017

@LemonBoy

This comment has been minimized.

Member

LemonBoy commented Jan 20, 2017

I'm against adding a new switch, I'd just change the logic to set the break point after every character with a width > 2 or a space (but iff the view is in utf8 mode)

@dequis

This comment has been minimized.

Member

dequis commented Jan 20, 2017

Judging by https://en.wikipedia.org/wiki/Line_breaking_rules_in_East_Asian_languages CJK text wrapping tends to be "meh, do whatever you want, don't mess the punctuation up". So the patch is fine as a slight improvement.

The proper unicode line wrapping algo has a ton of edge cases, as usual: http://unicode.org/reports/tr14/

glib has:

https://developer.gnome.org/glib/stable/glib-Unicode-Manipulation.html#g-unichar-break-type

https://developer.gnome.org/glib/stable/glib-Unicode-Manipulation.html#GUnicodeBreakType ?

They suggest using pango for text wrapping, but pango is mostly a gtk-related thing (people may not want that in servers) which usually assumes more flexible renderers than terminals. We could (in the future, not necessarily in this PR) implement some of the less ambiguous break types.

I have no strong opinion regarding the switch. Dropping the utf8 condition seems suspicious though.

@ailin-nemui

This comment has been minimized.

Contributor

ailin-nemui commented Jan 20, 2017

it looks like pango depends on X even though the tr14 algorithm implemented in pango_break does not. the algorithm is rendering-agnostic, it only tells you which points are valid break points. However, it would mess with URLs for instance

@ailin-nemui

This comment has been minimized.

Contributor

ailin-nemui commented Jan 20, 2017

The actual reason for the initial conditional on utf8 was to enable the break-wide function for the "Big5" terminal type, a 2 byte easy asian encoding which is ascii compatible

@ailin-nemui

This comment has been minimized.

Contributor

ailin-nemui commented Jan 20, 2017

are there any use cases where breaking wide characters may be undesirable? my guess would be, yes. I suppose we would need to look at non east asian wide characters to decide that

@dequis

This comment has been minimized.

Member

dequis commented Jan 20, 2017

Somewhat relevant thing I was reading yesterday http://www.loekalization.com/mistakes.html section 8b

By the way, there's a reason why especially Japanese developers love hardcoding [newlines in] their strings: in Japan it's customary to wrap text by hand (Japanesehasnospaces, and it wouldn't be user-friendly to have the text automatically wrapped in the mid
dle of words).

So as far as I understand the thing, the machine-wrapping algorithms are lazy because it doesn't really matter a lot, but if you can control the wrapping manually it's desirable to insert breaks in cleverer places.

@dequis

This comment has been minimized.

Member

dequis commented Jan 21, 2017

Characters with east asian width F or W ranked by line break class

Count Class Description Examples
171828 ID - Ideographic Break before or after, except in some numeric context 𤎏 (U+2438F), 抴 (U+62B4), ꏌ (U+A3CC), 𦩕 (U+26A55), 𥮦 (U+25BA6)
10773 H3 - Hangul LVT Syllable Form Korean syllable blocks 츷 (U+CE37), 흯 (U+D76F), 몙 (U+BA99), 괫 (U+AD2B), 줒 (U+C912)
5782 AL - Alphabetic May not break 🔞 (U+1F51E), 𑣖 (U+118D6), 𐚐 (U+10690), ᯌ (U+1BCC), 𖥀 (U+16940)
662 CM - Combining Mark No break 𖽙 (U+16F59), 𑍌 (U+1134C), 𝩕 (U+1DA55), 𝨘 (U+1DA18), 𑆶 (U+111B6)
399 H2 - Hangul LV Syllable Form Korean syllable blocks 쁴 (U+C074), 뜨 (U+B728), 쀄 (U+C004), 기 (U+AE30), 떠 (U+B5A0)
170 NU - Numeric Form numeric expressions for line breaking purposes 𑑔 (U+11454), 𑋲 (U+112F2), 𖩣 (U+16A63), 𑇘 (U+111D8), 𑛈 (U+116C8)
125 JL - Hangul L Jamo Form Korean syllable blocks ᄫ (U+112B), ᅝ (U+115D), ᄂ (U+1102), ꥶ (U+A976), ꥭ (U+A96D)
100 AI - Ambiguous Act like AL when the resolved EAW is N; otherwise, act as ID 🅤 (U+1F164), 🄵 (U+1F135), 🅉 (U+1F149), 🅑 (U+1F151), 🆜 (U+1F19C)
81 BA - Break After May break after 𑈹 (U+11239), 𖩮 (U+16A6E), 𑈸 (U+11238), 𑗍 (U+115CD), 𑗊 (U+115CA)
79 EB - Emoji Base Do not break from following Emoji Modifier 👵 (U+1F475), 🏋 (U+1F3CB), 💁 (U+1F481), 🤼 (U+1F93C), 👈 (U+1F448)
71 SA - Complex Context Dependent May break (req language-specific context analysis) 𑜌 (U+1170C), 𑜺 (U+1173A), ꩾ (U+AA7E), ꩽ (U+AA7D), 𑜁 (U+11701)
55 JT - Hangul T Jamo Form Korean syllable blocks ퟟ (U+D7DF), ퟠ (U+D7E0), ퟏ (U+D7CF), ퟕ (U+D7D5), ퟵ (U+D7F5)
41 CJ - Conditional Japanese Starter Treat as NS or ID for strict or normal breaking. ぅ (U+3045), ぇ (U+3047), ㇼ (U+31FC), ォ (U+30A9), ッ (U+30C3)
38 CL - Close Punctuation No break before ﹄ (U+FE44), ﹀ (U+FE40), ︒ (U+FE12), ︾ (U+FE3E), ︶ (U+FE36)
32 OP - Open Punctuation No break after 《 (U+300A), [ (U+FF3B), ︻ (U+FE3B), 『 (U+300E), ﹝ (U+FE5D)
28 JV - Hangul V Jamo Form Korean syllable blocks ퟆ (U+D7C6), ᆧ (U+11A7), ힺ (U+D7BA), ᆦ (U+11A6), ힿ (U+D7BF)
27 PR - Prefix Numeric Do not break in front of a numeric expression ⃄ (U+20C4), ⃁ (U+20C1), ֏ (U+058F), £ (U+FFE1), ⃋ (U+20CB)
26 RI - Regional Indicator Keep pairs together. 🇸 (U+1F1F8), 🇹 (U+1F1F9), 🇰 (U+1F1F0), 🇲 (U+1F1F2), 🇴 (U+1F1F4)
21 NS - Nonstarter Allow only indirect line breaks before ﹔ (U+FE54), 々 (U+3005), ゝ (U+309D), ゜ (U+309C), 🙻 (U+1F67B)
18 BB - Break Before May break before 𑙧 (U+11667), 𑙨 (U+11668), 𑙡 (U+11661), 𑙦 (U+11666), 𑙠 (U+11660)
9 EX - Exclamation/Interrogation No break before ︖ (U+FE16), 𑗅 (U+115C5), 𑗄 (U+115C4), 𑱱 (U+11C71), ﹖ (U+FE56)
5 PO - Postfix Numeric Do not break following a numeric expression ₻ (U+20BB), ₾ (U+20BE), ﹪ (U+FE6A), % (U+FF05), ¢ (U+FFE0)
5 EM - Emoji Modifier Do not break from preceding Emoji Base 🏻 (U+1F3FB), 🏼 (U+1F3FC), 🏽 (U+1F3FD), 🏾 (U+1F3FE), 🏿 (U+1F3FF)
5 QU - Quotation Act like they are both opening and closing ❟ (U+275F), ❠ (U+2760), 🙶 (U+1F676), 🙷 (U+1F677), 🙸 (U+1F678)
3 IS - Infix Numeric Separator Prevent breaks after any and before numeric ︐ (U+FE10), ︓ (U+FE13), ︔ (U+FE14)
2 IN - Inseparable Allow only indirect line breaks between pairs ︙ (U+FE19), 𐫶 (U+10AF6)
2 GL - Non-breaking ("Glue") No break ࿙ (U+0FD9), ࿚ (U+0FDA)
2 B2 - Break Opportunity Before and After May break ⸺ (U+2E3A), ⸻ (U+2E3B)

(source code)

I simplified the wording of some items in the description column, it might not be accurate. AL was extremely vague but that one is easy to guess. SA is fun.

Hangul syllables seem like a great reason to keep the setting. That's pretty much my conclusion to all of this.

I keep thinking it might be fun to write a very basic implementation of the algo covering a subset of these classes, but in the context of this PR, yeah, ship it.

@ailin-nemui ailin-nemui merged commit 228f487 into irssi:master Jan 21, 2017

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
@ailin-nemui

This comment has been minimized.

Contributor

ailin-nemui commented Jan 21, 2017

Wow, such extensive research. Thanks dx

@ailin-nemui ailin-nemui deleted the ailin-nemui:chirssi branch Jan 22, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment