Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode BOM (U+FEFF) is not displayed in view whitespaces mode #120

Closed
Sturwandan opened this issue Jul 11, 2021 · 11 comments
Closed

Unicode BOM (U+FEFF) is not displayed in view whitespaces mode #120

Sturwandan opened this issue Jul 11, 2021 · 11 comments

Comments

@Sturwandan
Copy link

Sturwandan commented Jul 11, 2021

The Byte-Order-Mark is a pseudo-character, which is used by some applications to detect the byte order of UTF-16 and UTF-32 encodings. Unlike those two, UTF-8 has no problems with the byte order whatsoever, however some text editors and IDEs erroneously insert it into UTF-8 files anyway, breaking the main feature we love it for: full ASCII compatibility.
Nevertheless, this made some other programs adopt incorrect usage of UTF-8 BOM as well, namely, to tell UTF-8 apart from obsolete 8-bit encodings, which is also wrong, because UTF-8 has a regular structure, which is easy to detect. While most other programs can deal with good valid UTF-8, but error on BOMed UTF-8 files, produced by former programs, which I'd say is not a bug.

The former meaning of this character was "zero width space", though at the moment, Unicode has a different dedicated code point for ZWS, reserving U+FEFF as BOM only.

Other text editors deal with BOM in different ways, some strip it upon saving, some add it, some don't touch it, some have settings which sets the behavior. From my testing, Text Adept deals with BOM in its own way of kind "I don't care", meaning it is treated as an arbitrary character and (correctly?) displayed as zero width space, which manifests as following:

The text looks the same, as without BOM, however if you put the cursor in the beginning of the first line and hit DEL key, the first visible character is unaffected, however the BOM is stripped, and upon saving the file contains it no more. This behavior is not unexpected in regular mode, however after selecting menu option "Toggle View Whitespace", the BOM is still not visible, unlike tabulation and regular space.

While this behavior is acceptable for regular mode, I think that Text Adept should show BOM, for example, as [BOM] (inverted text) similar to [SOH] and [NUL] at the very least in View Whitespace mode, for the user has to be able to deal with it from the text editor itself.

But, I think that it should be displayed even in regular mode by default, to make dealing with this sort of problem easier, so you can see that your UTF-8 file has BOM as soon as you open it, but that's at your discretion.

{ echo -en '\xef\xbb\xbf'; cat foo.txt; } > bar.txt

The command above can be used to add erroneous UTF-8 BOM to foo.txt and write the result to bar.txt. It can be stripped from the file using Text Adept by the method indicated above. My only suggestion is to make it easier to strip, and (in rare cases, when you have to deal with buggy software, add) the BOM by displaying it.

@orbitalquark
Copy link
Owner

orbitalquark commented Jul 12, 2021 via email

@Sturwandan
Copy link
Author

Sturwandan commented Jul 13, 2021

Thank you for the suggestion. The line seems to kinda work, however there were several problems with it.
Yet, they might be because I didn't update Text Adept since I have installed it last year, so perhaps I'm running into already fixed bugs:

  1. The BOM thing disappeared from TA, when I switched to different tab and back. I have re-checked the file, it was still in there.
  2. I decided to do similar thing with CR, because it doesn't really work as a line terminator in UNIX systems. As far as I know only ancient Macs made use of lone CR as line terminators and I haven't ever seen such a text file, so I thought that I'd rather see CR as black box, similar to other control characters, then a line terminator:
view.representation['\x0d'] = 'CR'

This one didn't work at all.
3. When I use reload from tab context menu, the [BOM] thing also disappears, while the file still has it.

Anyway, the reason why I have made it a bug report, instead of searching documentation, was because I believe, that this is going to be universally useful to users to have this feature available by default.

In general I wanted "view whitespace" feature to work with all kinds of whitespace characters, not just tabs and spaces.

@orbitalquark
Copy link
Owner

orbitalquark commented Jul 13, 2021 via email

@Sturwandan
Copy link
Author

Sturwandan commented Jul 13, 2021

There is a separate setting for viewing EOL characters like CR: view.view_eol = true

This setting would show both CR and LF, but my idea was to use LF as line terminator, while viewing CR as a regular character, similar to what less is doing. If LF was one and the only thing which terminates the line (such as CR in the middle of a line won't break the line), then regular good ASCII/UTF-8 files would look fine, but BOM or CR-LF files will be instantly visible.

@orbitalquark
Copy link
Owner

orbitalquark commented Jul 13, 2021 via email

@Sturwandan
Copy link
Author

Sturwandan commented Jul 14, 2021

OK, this code in my init.lua seems to produce the desired effect. I'll need to test it a bit more to make sure it has no other pitfalls, though:

-- Make Textadept display BOM with visible mark
view.representation['\xef\xbb\xbf'] = 'BOM'
events.connect(events.BUFFER_AFTER_SWITCH, function()
		view.representation['\xef\xbb\xbf'] = 'BOM'
	end)

Also after some thinking, I'd change my proposal regarding line terminators: make them show up if they are different from the intended ones. For example in LF mode, CRs shows up as a black boxes, including if they appear at the end of line, while CRLF mode both lone CR, lone LF and LF CR sequences (except CR LF CR LF) show up as black boxes, while CRLF terminators remain invisible and manifest themselves only as line breaks.

Next idea: make whitespaces at the ends of lines always show up, while hiding whitespaces in the middle. Do you think I need to open a new bug report for that? It's what, say, Kwrite does, and it's convenient, because of no need to switch modes to catch bogus whitespaces before the line breaks without sacrificing aesthetic.

Also it is a good idea to make whitespaces at the beginning show up, if they are different from current indentation setting. For example, if I have 4 spaces indentation with \HT, and in certain file most lines are indented, as they are supposed to be, with tabs, but one line has spaces instead of tabs, then spaces are visible, and vice versa.

My idea in the big picture: make the text in the editor correspond to its binary on disk as close as possible without sacrificing aesthetic and readability, which means I can't just accept black [LF] boxes sticking out at the end of every line, as well as black dots between every two words. But if, say, regular line terminators and whitespaces were invisible, but bogus ones are visible, then it could be achieved.

@orbitalquark
Copy link
Owner

orbitalquark commented Jul 16, 2021 via email

@Sturwandan
Copy link
Author

@orbitalquark I see...
Regarding line terminators, though, is lack of "CR only" mode coming from Scintilla or Text Adept?
I have read that Mac used this line terminator, however I don't know how it goes in modern MacOS X.

Do you think that Text Adept should have CR mode or not? I personally, did never see such files in the wild.

@orbitalquark
Copy link
Owner

orbitalquark commented Jul 17, 2021 via email

@Sturwandan
Copy link
Author

Not quite, I didn't ask this question before. Anyway, so my proposal would be: show line breaks inconsistent with current settings as black boxes.

@orbitalquark
Copy link
Owner

orbitalquark commented Mar 17, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants