New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unicode BOM (U+FEFF) is not displayed in view whitespaces mode #120
Comments
Textadept is ignorant about BOMs. It silently reads and writes them as buffer contents with no special treatment. Textadept's editing component is Scintilla (https://scintilla.org), which does not show BOMs for the "View Whitespace" feature. Instead, you can put the following in your *~/.textadept/init.lua*:
view.representation['\xef\xbb\xbf'] = 'BOM'
This will always display the BOM. You can also create a function that toggles the display of whitespace and BOMs and use that.
|
Thank you for the suggestion. The line seems to kinda work, however there were several problems with it.
This one didn't work at all. Anyway, the reason why I have made it a bug report, instead of searching documentation, was because I believe, that this is going to be universally useful to users to have this feature available by default. In general I wanted "view whitespace" feature to work with all kinds of whitespace characters, not just tabs and spaces. |
1. Okay, it looks like `view.representation` settings do not persist when switching between buffers. I'm not sure if this is a Scintilla bug or not. You can workaround by applying the setting in an `events.BUFFER_AFTER_SWITCH` event handler.
2. There is a separate setting for viewing EOL characters like CR: `view.view_eol = true`
3. BOM remains for me on buffer reload, but I am on the latest beta.
Textadept's buffer and view API is a thin wrapper around Scintilla's features. If Scintilla doesn't consider BOMs as whitespace in its "View Whitespace" setting, I'm not sure I want to argue with that.
|
This setting would show both CR and LF, but my idea was to use LF as line terminator, while viewing CR as a regular character, similar to what |
I understand, but it doesn't look like Scintilla supports this. It seems to break on either CR and LF, irrespective of the actual EOL mode.
|
OK, this code in my init.lua seems to produce the desired effect. I'll need to test it a bit more to make sure it has no other pitfalls, though:
Also after some thinking, I'd change my proposal regarding line terminators: make them show up if they are different from the intended ones. For example in LF mode, CRs shows up as a black boxes, including if they appear at the end of line, while CRLF mode both lone CR, lone LF and LF CR sequences (except CR LF CR LF) show up as black boxes, while CRLF terminators remain invisible and manifest themselves only as line breaks. Next idea: make whitespaces at the ends of lines always show up, while hiding whitespaces in the middle. Do you think I need to open a new bug report for that? It's what, say, Kwrite does, and it's convenient, because of no need to switch modes to catch bogus whitespaces before the line breaks without sacrificing aesthetic. Also it is a good idea to make whitespaces at the beginning show up, if they are different from current indentation setting. For example, if I have 4 spaces indentation with \HT, and in certain file most lines are indented, as they are supposed to be, with tabs, but one line has spaces instead of tabs, then spaces are visible, and vice versa. My idea in the big picture: make the text in the editor correspond to its binary on disk as close as possible without sacrificing aesthetic and readability, which means I can't just accept black [LF] boxes sticking out at the end of every line, as well as black dots between every two words. But if, say, regular line terminators and whitespaces were invisible, but bogus ones are visible, then it could be achieved. |
You can submit all of these ideas to the Scintilla bug tracker, as they would be more appropriate there: https://sourceforge.net/p/scintilla/feature-requests/. As I mentioned before, Textadept uses Scintilla as its editing component. If your suggestions made it into Scintilla, then any editor based on Scintilla would benefit from this.
|
@orbitalquark I see... Do you think that Text Adept should have CR mode or not? I personally, did never see such files in the wild. |
Ah, I must have misunderstood that bit. Sorry about that. CR-only mode was removed as a visible menu option long ago due to extinction. You can manually enable it by setting `buffer.eol_mode = buffer.EOL_CR` via command entry, key binding, etc.
|
Not quite, I didn't ask this question before. Anyway, so my proposal would be: show line breaks inconsistent with current settings as black boxes. |
I'm going to close this because my understanding is the requested features/enhancements belong in Scintilla, not Textadept. If they were in Scintilla, Textadept would get them. Please correct me if I'm wrong.
|
The Byte-Order-Mark is a pseudo-character, which is used by some applications to detect the byte order of UTF-16 and UTF-32 encodings. Unlike those two, UTF-8 has no problems with the byte order whatsoever, however some text editors and IDEs erroneously insert it into UTF-8 files anyway, breaking the main feature we love it for: full ASCII compatibility.
Nevertheless, this made some other programs adopt incorrect usage of UTF-8 BOM as well, namely, to tell UTF-8 apart from obsolete 8-bit encodings, which is also wrong, because UTF-8 has a regular structure, which is easy to detect. While most other programs can deal with good valid UTF-8, but error on BOMed UTF-8 files, produced by former programs, which I'd say is not a bug.
The former meaning of this character was "zero width space", though at the moment, Unicode has a different dedicated code point for ZWS, reserving U+FEFF as BOM only.
Other text editors deal with BOM in different ways, some strip it upon saving, some add it, some don't touch it, some have settings which sets the behavior. From my testing, Text Adept deals with BOM in its own way of kind "I don't care", meaning it is treated as an arbitrary character and (correctly?) displayed as zero width space, which manifests as following:
The text looks the same, as without BOM, however if you put the cursor in the beginning of the first line and hit DEL key, the first visible character is unaffected, however the BOM is stripped, and upon saving the file contains it no more. This behavior is not unexpected in regular mode, however after selecting menu option "Toggle View Whitespace", the BOM is still not visible, unlike tabulation and regular space.
While this behavior is acceptable for regular mode, I think that Text Adept should show BOM, for example, as [BOM] (inverted text) similar to [SOH] and [NUL] at the very least in View Whitespace mode, for the user has to be able to deal with it from the text editor itself.
But, I think that it should be displayed even in regular mode by default, to make dealing with this sort of problem easier, so you can see that your UTF-8 file has BOM as soon as you open it, but that's at your discretion.
{ echo -en '\xef\xbb\xbf'; cat foo.txt; } > bar.txt
The command above can be used to add erroneous UTF-8 BOM to foo.txt and write the result to bar.txt. It can be stripped from the file using Text Adept by the method indicated above. My only suggestion is to make it easier to strip, and (in rare cases, when you have to deal with buggy software, add) the BOM by displaying it.
The text was updated successfully, but these errors were encountered: