Unicode BOM (U+FEFF) is not displayed in view whitespaces mode #120

Sturwandan · 2021-07-11T22:20:52Z

The Byte-Order-Mark is a pseudo-character, which is used by some applications to detect the byte order of UTF-16 and UTF-32 encodings. Unlike those two, UTF-8 has no problems with the byte order whatsoever, however some text editors and IDEs erroneously insert it into UTF-8 files anyway, breaking the main feature we love it for: full ASCII compatibility.
Nevertheless, this made some other programs adopt incorrect usage of UTF-8 BOM as well, namely, to tell UTF-8 apart from obsolete 8-bit encodings, which is also wrong, because UTF-8 has a regular structure, which is easy to detect. While most other programs can deal with good valid UTF-8, but error on BOMed UTF-8 files, produced by former programs, which I'd say is not a bug.

The former meaning of this character was "zero width space", though at the moment, Unicode has a different dedicated code point for ZWS, reserving U+FEFF as BOM only.

Other text editors deal with BOM in different ways, some strip it upon saving, some add it, some don't touch it, some have settings which sets the behavior. From my testing, Text Adept deals with BOM in its own way of kind "I don't care", meaning it is treated as an arbitrary character and (correctly?) displayed as zero width space, which manifests as following:

The text looks the same, as without BOM, however if you put the cursor in the beginning of the first line and hit DEL key, the first visible character is unaffected, however the BOM is stripped, and upon saving the file contains it no more. This behavior is not unexpected in regular mode, however after selecting menu option "Toggle View Whitespace", the BOM is still not visible, unlike tabulation and regular space.

While this behavior is acceptable for regular mode, I think that Text Adept should show BOM, for example, as [BOM] (inverted text) similar to [SOH] and [NUL] at the very least in View Whitespace mode, for the user has to be able to deal with it from the text editor itself.

But, I think that it should be displayed even in regular mode by default, to make dealing with this sort of problem easier, so you can see that your UTF-8 file has BOM as soon as you open it, but that's at your discretion.

{ echo -en '\xef\xbb\xbf'; cat foo.txt; } > bar.txt

The command above can be used to add erroneous UTF-8 BOM to foo.txt and write the result to bar.txt. It can be stripped from the file using Text Adept by the method indicated above. My only suggestion is to make it easier to strip, and (in rare cases, when you have to deal with buggy software, add) the BOM by displaying it.

The text was updated successfully, but these errors were encountered:

orbitalquark · 2021-07-12T13:03:57Z

Textadept is ignorant about BOMs. It silently reads and writes them as buffer contents with no special treatment. Textadept's editing component is Scintilla (https://scintilla.org), which does not show BOMs for the "View Whitespace" feature. Instead, you can put the following in your *~/.textadept/init.lua*: view.representation['\xef\xbb\xbf'] = 'BOM' This will always display the BOM. You can also create a function that toggles the display of whitespace and BOMs and use that.

Sturwandan · 2021-07-13T00:59:28Z

Thank you for the suggestion. The line seems to kinda work, however there were several problems with it.
Yet, they might be because I didn't update Text Adept since I have installed it last year, so perhaps I'm running into already fixed bugs:

The BOM thing disappeared from TA, when I switched to different tab and back. I have re-checked the file, it was still in there.
I decided to do similar thing with CR, because it doesn't really work as a line terminator in UNIX systems. As far as I know only ancient Macs made use of lone CR as line terminators and I haven't ever seen such a text file, so I thought that I'd rather see CR as black box, similar to other control characters, then a line terminator:

view.representation['\x0d'] = 'CR'

This one didn't work at all.
3. When I use reload from tab context menu, the [BOM] thing also disappears, while the file still has it.

Anyway, the reason why I have made it a bug report, instead of searching documentation, was because I believe, that this is going to be universally useful to users to have this feature available by default.

In general I wanted "view whitespace" feature to work with all kinds of whitespace characters, not just tabs and spaces.

orbitalquark · 2021-07-13T16:16:30Z

1. Okay, it looks like `view.representation` settings do not persist when switching between buffers. I'm not sure if this is a Scintilla bug or not. You can workaround by applying the setting in an `events.BUFFER_AFTER_SWITCH` event handler. 2. There is a separate setting for viewing EOL characters like CR: `view.view_eol = true` 3. BOM remains for me on buffer reload, but I am on the latest beta. Textadept's buffer and view API is a thin wrapper around Scintilla's features. If Scintilla doesn't consider BOMs as whitespace in its "View Whitespace" setting, I'm not sure I want to argue with that.

Sturwandan · 2021-07-13T16:56:45Z

There is a separate setting for viewing EOL characters like CR: view.view_eol = true

This setting would show both CR and LF, but my idea was to use LF as line terminator, while viewing CR as a regular character, similar to what less is doing. If LF was one and the only thing which terminates the line (such as CR in the middle of a line won't break the line), then regular good ASCII/UTF-8 files would look fine, but BOM or CR-LF files will be instantly visible.

orbitalquark · 2021-07-13T17:04:14Z

I understand, but it doesn't look like Scintilla supports this. It seems to break on either CR and LF, irrespective of the actual EOL mode.

Sturwandan · 2021-07-14T01:20:10Z

OK, this code in my init.lua seems to produce the desired effect. I'll need to test it a bit more to make sure it has no other pitfalls, though:

-- Make Textadept display BOM with visible mark
view.representation['\xef\xbb\xbf'] = 'BOM'
events.connect(events.BUFFER_AFTER_SWITCH, function()
		view.representation['\xef\xbb\xbf'] = 'BOM'
	end)

Also after some thinking, I'd change my proposal regarding line terminators: make them show up if they are different from the intended ones. For example in LF mode, CRs shows up as a black boxes, including if they appear at the end of line, while CRLF mode both lone CR, lone LF and LF CR sequences (except CR LF CR LF) show up as black boxes, while CRLF terminators remain invisible and manifest themselves only as line breaks.

Next idea: make whitespaces at the ends of lines always show up, while hiding whitespaces in the middle. Do you think I need to open a new bug report for that? It's what, say, Kwrite does, and it's convenient, because of no need to switch modes to catch bogus whitespaces before the line breaks without sacrificing aesthetic.

Also it is a good idea to make whitespaces at the beginning show up, if they are different from current indentation setting. For example, if I have 4 spaces indentation with \HT, and in certain file most lines are indented, as they are supposed to be, with tabs, but one line has spaces instead of tabs, then spaces are visible, and vice versa.

My idea in the big picture: make the text in the editor correspond to its binary on disk as close as possible without sacrificing aesthetic and readability, which means I can't just accept black [LF] boxes sticking out at the end of every line, as well as black dots between every two words. But if, say, regular line terminators and whitespaces were invisible, but bogus ones are visible, then it could be achieved.

orbitalquark · 2021-07-16T13:16:22Z

You can submit all of these ideas to the Scintilla bug tracker, as they would be more appropriate there: https://sourceforge.net/p/scintilla/feature-requests/. As I mentioned before, Textadept uses Scintilla as its editing component. If your suggestions made it into Scintilla, then any editor based on Scintilla would benefit from this.

Sturwandan · 2021-07-17T11:23:50Z

@orbitalquark I see...
Regarding line terminators, though, is lack of "CR only" mode coming from Scintilla or Text Adept?
I have read that Mac used this line terminator, however I don't know how it goes in modern MacOS X.

Do you think that Text Adept should have CR mode or not? I personally, did never see such files in the wild.

orbitalquark · 2021-07-17T13:40:31Z

Ah, I must have misunderstood that bit. Sorry about that. CR-only mode was removed as a visible menu option long ago due to extinction. You can manually enable it by setting `buffer.eol_mode = buffer.EOL_CR` via command entry, key binding, etc.

Sturwandan · 2021-07-17T14:50:35Z

Not quite, I didn't ask this question before. Anyway, so my proposal would be: show line breaks inconsistent with current settings as black boxes.

orbitalquark · 2022-03-17T17:12:38Z

I'm going to close this because my understanding is the requested features/enhancements belong in Scintilla, not Textadept. If they were in Scintilla, Textadept would get them. Please correct me if I'm wrong.

orbitalquark closed this as completed Mar 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode BOM (U+FEFF) is not displayed in view whitespaces mode #120

Unicode BOM (U+FEFF) is not displayed in view whitespaces mode #120

Sturwandan commented Jul 11, 2021 •

edited

orbitalquark commented Jul 12, 2021 via email

Sturwandan commented Jul 13, 2021 •

edited

orbitalquark commented Jul 13, 2021 via email

Sturwandan commented Jul 13, 2021 •

edited

orbitalquark commented Jul 13, 2021 via email

Sturwandan commented Jul 14, 2021 •

edited

orbitalquark commented Jul 16, 2021 via email

Sturwandan commented Jul 17, 2021

orbitalquark commented Jul 17, 2021 via email

Sturwandan commented Jul 17, 2021

orbitalquark commented Mar 17, 2022 via email

Unicode BOM (U+FEFF) is not displayed in view whitespaces mode #120

Unicode BOM (U+FEFF) is not displayed in view whitespaces mode #120

Comments

Sturwandan commented Jul 11, 2021 • edited

orbitalquark commented Jul 12, 2021 via email

Sturwandan commented Jul 13, 2021 • edited

orbitalquark commented Jul 13, 2021 via email

Sturwandan commented Jul 13, 2021 • edited

orbitalquark commented Jul 13, 2021 via email

Sturwandan commented Jul 14, 2021 • edited

orbitalquark commented Jul 16, 2021 via email

Sturwandan commented Jul 17, 2021

orbitalquark commented Jul 17, 2021 via email

Sturwandan commented Jul 17, 2021

orbitalquark commented Mar 17, 2022 via email

Sturwandan commented Jul 11, 2021 •

edited

Sturwandan commented Jul 13, 2021 •

edited

Sturwandan commented Jul 13, 2021 •

edited

Sturwandan commented Jul 14, 2021 •

edited