Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode support for breakat so that linebreak works with unicode languages (Japanese, Chinese etc) #13967

Open
1dancook opened this issue Feb 18, 2021 · 4 comments
Labels
enhancement feature request

Comments

@1dancook
Copy link

I wonder if unicode support could be added for breakat.

Personally I write with unicode (Japanese) and linebreak doesn't work for it. In something like markdown you end up getting some weird things. The lines do wrap, however, for something like a blockquote you would get the following for a line (breaks at the space):

> 
日本語とか。。。

:help breakat states the following --

'breakat' 'brk'		string	(default " ^I!@*-+;:,./?")
			        global
	This option lets you choose which characters might cause a line
	break if 'linebreak' is on.  Only works for ASCII characters.

I don't know how linebreak works internally - but I wonder I summize the following logic may be enough:

if the character at the last column is any unicode character:
    break the line
otherwise if the character at the last column is in the breakat string:
    break the line
for any other case:
    don't break the line

If it's possible to implement that kind of logic, in the case of Japanese and Chinese I believe this would work fine. It's literally just a long line of non-stop unicode. Perhaps other languages are similar but I'm not sure about that.

よろしく!

@1dancook 1dancook added the enhancement feature request label Feb 18, 2021
@worldeva
Copy link

worldeva commented Feb 22, 2021

Breaking for any unicode character wouldn't work, because then lächeln would break on ä, etc.

The only way to handle this that I see is to check for the unicode group and break at certain groups. Perhaps give a setting for which ones?

Gives up some of the control of the breakat setting for unicode characters, but would be a large performance improvement. The other methods are to linearly search for the character in the string, initialize and fill a 65,536 large table, or use a hashtable. Exact control over broken characters is rarely used anyways.

@1dancook
Copy link
Author

@worldeva I see what you mean.

After checking, I didn't realize actually that there are basically no languages (not programming 😝 ) that use strictly ASCII characters. English also uses unicode characters.

For reference, I believe you are referring to unicode blocks as defined like this?

I think it's a good idea that you have:

1 - define which unicode blocks should break
2 - break the line if the character at the last column is in a defined block from (1)
3 - otherwise fallback to the default ASCII behavior of breakat

I don't think there are a lot of blocks (read: languages) that need that kind of special treatment. For a languages that use mostly latin lettering, I believe they will use the normal ascii spaces etc and the default way vim wraps and breaks seems to work fine.

In regards to performance, I think it should be off by default and enabled by a setting. I'm sure a majority of vim/nvim users do not need this. I wouldn't have this on for a python file but I would turn it on for markdown, for example.

@worldeva
Copy link

worldeva commented Mar 6, 2021

Well-
I got halfway through doing the current pr before realizing...
An actual paper from the unicode consortium on unicode line breaking exists.
So I'm going to implement that (which should mean much better line breaking in general for any language, (linebreak, formatting, etc.)) before coming back and seeing if the feature in #14029 is still necessary.

@1dancook
Copy link
Author

1dancook commented Mar 8, 2021

That looks like a lot to sort through, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement feature request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants