Unicode support for `breakat` so that linebreak works with unicode languages (Japanese, Chinese etc) #13967

1dancook · 2021-02-18T04:47:25Z

I wonder if unicode support could be added for breakat.

Personally I write with unicode (Japanese) and linebreak doesn't work for it. In something like markdown you end up getting some weird things. The lines do wrap, however, for something like a blockquote you would get the following for a line (breaks at the space):

> 
日本語とか。。。

:help breakat states the following --

'breakat' 'brk'		string	(default " ^I!@*-+;:,./?")
			        global
	This option lets you choose which characters might cause a line
	break if 'linebreak' is on.  Only works for ASCII characters.

I don't know how linebreak works internally - but I wonder I summize the following logic may be enough:

if the character at the last column is any unicode character:
    break the line
otherwise if the character at the last column is in the breakat string:
    break the line
for any other case:
    don't break the line

If it's possible to implement that kind of logic, in the case of Japanese and Chinese I believe this would work fine. It's literally just a long line of non-stop unicode. Perhaps other languages are similar but I'm not sure about that.

よろしく！

The text was updated successfully, but these errors were encountered:

worldeva · 2021-02-22T16:03:35Z

Breaking for any unicode character wouldn't work, because then lächeln would break on ä, etc.

The only way to handle this that I see is to check for the unicode group and break at certain groups. Perhaps give a setting for which ones?

Gives up some of the control of the breakat setting for unicode characters, but would be a large performance improvement. The other methods are to linearly search for the character in the string, initialize and fill a 65,536 large table, or use a hashtable. Exact control over broken characters is rarely used anyways.

1dancook · 2021-02-23T07:08:23Z

@worldeva I see what you mean.

After checking, I didn't realize actually that there are basically no languages (not programming 😝 ) that use strictly ASCII characters. English also uses unicode characters.

For reference, I believe you are referring to unicode blocks as defined like this?

I think it's a good idea that you have:

1 - define which unicode blocks should break
2 - break the line if the character at the last column is in a defined block from (1)
3 - otherwise fallback to the default ASCII behavior of breakat

I don't think there are a lot of blocks (read: languages) that need that kind of special treatment. For a languages that use mostly latin lettering, I believe they will use the normal ascii spaces etc and the default way vim wraps and breaks seems to work fine.

In regards to performance, I think it should be off by default and enabled by a setting. I'm sure a majority of vim/nvim users do not need this. I wouldn't have this on for a python file but I would turn it on for markdown, for example.

worldeva · 2021-03-06T17:23:36Z

Well-
I got halfway through doing the current pr before realizing...
An actual paper from the unicode consortium on unicode line breaking exists.
So I'm going to implement that (which should mean much better line breaking in general for any language, (linebreak, formatting, etc.)) before coming back and seeing if the feature in #14029 is still necessary.

1dancook · 2021-03-08T14:30:22Z

That looks like a lot to sort through, thank you!

1dancook added the enhancement feature request label Feb 18, 2021

worldeva mentioned this issue Feb 27, 2021

[WIP] Add unicode support for linebreak and textwidth #14029

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode support for `breakat` so that linebreak works with unicode languages (Japanese, Chinese etc) #13967

Unicode support for `breakat` so that linebreak works with unicode languages (Japanese, Chinese etc) #13967

1dancook commented Feb 18, 2021

worldeva commented Feb 22, 2021 •

edited

1dancook commented Feb 23, 2021

worldeva commented Mar 6, 2021 •

edited

1dancook commented Mar 8, 2021

Unicode support for breakat so that linebreak works with unicode languages (Japanese, Chinese etc) #13967

Unicode support for breakat so that linebreak works with unicode languages (Japanese, Chinese etc) #13967

Comments

1dancook commented Feb 18, 2021

worldeva commented Feb 22, 2021 • edited

1dancook commented Feb 23, 2021

worldeva commented Mar 6, 2021 • edited

1dancook commented Mar 8, 2021

Unicode support for `breakat` so that linebreak works with unicode languages (Japanese, Chinese etc) #13967

Unicode support for `breakat` so that linebreak works with unicode languages (Japanese, Chinese etc) #13967

worldeva commented Feb 22, 2021 •

edited

worldeva commented Mar 6, 2021 •

edited