Panic on non-UTF-8 files #18

vv9k · 2021-06-01T18:41:06Z

For example trying to open the hx binary:

❯ hx target/debug/hx
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: stream did not contain valid UTF-8', helix-term/src/main.rs:117:46
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

I don't know if there is a plan to support other text encodings but it sure would be nice to just be able to view a raw binary file as a fallback. The current implementation of Document uses Rope internally, from what I've seen it accepts only valid UTF-8 so not really sure how it could be handled here.

The text was updated successfully, but these errors were encountered:

cessen · 2021-06-01T20:07:46Z

Some thoughts below, from a semi-random passer-by:

For other text encodings: Ropey (the rope library Helix is using) is designed for streaming transcoding on both load and save. So using it in combination with something like the encoding_rs crate to do the transcoding, and some reasonable auto-detection of text encodings, would handle most valid text files you might encounter. (It won't handle all esoteric corner-cases, of course, because the text encoding landscape is... complicated. But it will at least handle most common encodings in a reasonable way.)

For binary files: there's not really an obvious "right" thing for a text editor to do, so it's more about just picking an option and going with it. Some possibilities include:

Including a hex-editor mode. This is what e.g. Sublime Text does.
Transcode it as latin-1, which round-trips losslessly for any file but displays the data as garbled text.
Refuse to open the file, with a message explaining that it's either binary or an unsupported text encoding. (And possibly suggest opening it in a hex editor.)

L-as · 2021-06-01T20:16:19Z

Optimally it would show it as UTF-8, but show some placeholder for incorrect encodings like in kakoune. Lossless round-trips aren't necessary IMO since often you don't want to edit the file, and if you do, it's often because you want to remove the incorrectly encoded parts.

archseer · 2021-06-02T00:59:16Z

(Hey @cessen 👋🏻 Thank you for building Ropey, it's great!)

So far I've made no effort to support anything other than UTF-8. I think moving forward we could use encoding_rs + hsivonen/chardetng for encoding detection, but I'd probably maintain a whitelist of encodings we'd allow. I've gotten hit by esoteric edge cases before where the encoding was under-specified.

cessen · 2021-06-02T05:07:00Z

(Hey @cessen 👋🏻 Thank you for building Ropey, it's great!)

You're very welcome! If you end up running into any problems with it, please don't hesitate to file an issue. :-)

but I'd probably maintain a whitelist of encodings we'd allow.

Yeah, I think that's a very reasonable approach. Supporting a curated set of encodings that you're confident work correctly is better than trying to support everything unreliably.

kirawi · 2021-06-10T04:42:16Z

I'll take a stab at this.

pickfire · 2021-06-10T09:35:33Z

A discussion happened at cessen/ropey#40, @cessen suggested that we use RopeBuilder if we intend to read from different file encoding, we may also want to do our own buffering if we do it that way.

cessen · 2021-06-10T18:32:42Z

To expand on what @pickfire said: I think using RopeBuilder is just generally the better way to go for Helix, even aside from transcoding. It will let you robustly handle a lot of things that from_reader() won't. from_reader() is really just a convenience method for more casual use, quick prototyping, etc.

For example, another good use-case for RopeBuilder is loading very large files. from_reader() will just block until the whole file is loaded, which isn't great if it takes a while. But using RopeBuilder + handling IO yourself, you can do things like showing a progress bar and letting the user cancel if it's taking too long.

pickfire · 2021-06-11T02:07:07Z

For example, another good use-case for RopeBuilder is loading very large files. from_reader() will just block until the whole file is loaded, which isn't great if it takes a while. But using RopeBuilder + handling IO yourself, you can do things like showing a progress bar and letting the user cancel if it's taking too long.

Even loading sqlite.c (5.2MB - 5385123 loc) is just taking roughly one second to load the file and start even in debug build. I wonder if we will get into cases where we need a progress bar for it. Anyone have any idea which file ls larger to test?

kirawi · 2021-06-11T02:18:41Z

For example, another good use-case for RopeBuilder is loading very large files. from_reader() will just block until the whole file is loaded, which isn't great if it takes a while. But using RopeBuilder + handling IO yourself, you can do things like showing a progress bar and letting the user cancel if it's taking too long.

Even loading sqlite.c (5.2MB - 5385123 loc) is just taking roughly one second to load the file and start even in debug build. I wonder if we will get into cases where we need a progress bar for it. Anyone have any idea which file ls larger to test?

Maybe a huge XML file?

archseer · 2021-06-11T02:33:14Z

Note: we're also discussing large file loading with regards to Led here: #219

cessen · 2021-06-11T02:38:29Z

Here are a couple of test files I use when testing my own editor:

100mb.txt.zip (download is ~450 KB, unzipped ~100 MB)
1gb.txt.zip (download is ~4.5 MB, unzipped ~1 GB)

Having said that, even the 1 GB file loads in ~1 second in my editor (which also uses Ropey). But I've heard that some people open 10-20GB (or even larger) log files sometimes, so that would start to become pretty significant wait times. And when transcoding on load because a file isn't utf8, those times will almost certainly go up quite a bit as well.

kirawi · 2021-06-11T15:16:14Z

Seems like we might want a wrapper over RopeBuilder for our usecase? Or does that already exist with Document?

kirawi · 2021-06-11T17:51:06Z

To expand on what @pickfire said: I think using RopeBuilder is just generally the better way to go for Helix, even aside from transcoding. It will let you robustly handle a lot of things that from_reader() won't. from_reader() is really just a convenience method for more casual use, quick prototyping, etc.

For example, another good use-case for RopeBuilder is loading very large files. from_reader() will just block until the whole file is loaded, which isn't great if it takes a while. But using RopeBuilder + handling IO yourself, you can do things like showing a progress bar and letting the user cancel if it's taking too long.

This is completely new to me, so I have a few stupid questions to ask if you don't mind. How would you use RopeBuilder to open a file in a non-blocking way? From the example you gave in the other thread, it would block until the entire file was read, wouldn't it?

cessen · 2021-06-11T18:22:11Z

How would you use RopeBuilder to open a file in a non-blocking way?

The short answer is simply that RopeBuilder doesn't handle IO, so you can do the IO yourself however you want (sync, async, hand-writing an event loop that periodically updates the UI and and checks for input, or whatever).

For example, you could do something like this (pseudo-code):

let mut reader = FancyBufReader::new(my_file);
let mut builder = RopeBuilder::new();

while let Ok(text_chunk) = reader.read_next_str_chunk_of_max_length(4096) {
    builder.append(text_chunk);
    ui_update_progress_bar();
    if user_input_cancel() {
        return None;
    }
}

return Some(builder.finish());

This does IO incrementally, a chunk at a time, and updates the UI and checks user input as it goes.

But RopeBuilder has nothing to do with IO at all, so you could write something similar to this but using async IO primitives or whatever instead. RopeBuilder doesn't care, it just takes str slices.

From the example you gave in the other thread, it would block until the entire file was read, wouldn't it?

I'm not sure exactly which example you're referring to, but if it calls from_reader(), then yes. from_reader() always blocks until the entire file is read (errors notwithstanding)

Hope that helps!

vv9k changed the title ~~Panic on non-utf8 files~~ Panic on non-UTF-8 files Jun 1, 2021

rikusalminen mentioned this issue Jun 3, 2021

Panics on smaller terminal sizes #74

Closed

archseer assigned kirawi Jun 10, 2021

This was referenced Jun 11, 2021

Inspiration from some less-known editors #219

Closed

Handle non-UTF8 files #228

Merged

archseer closed this as completed in #228 Jun 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Panic on non-UTF-8 files #18

Panic on non-UTF-8 files #18

vv9k commented Jun 1, 2021

cessen commented Jun 1, 2021

L-as commented Jun 1, 2021

archseer commented Jun 2, 2021

cessen commented Jun 2, 2021

kirawi commented Jun 10, 2021

pickfire commented Jun 10, 2021 •

edited

cessen commented Jun 10, 2021

pickfire commented Jun 11, 2021

kirawi commented Jun 11, 2021

archseer commented Jun 11, 2021

cessen commented Jun 11, 2021

kirawi commented Jun 11, 2021 •

edited

kirawi commented Jun 11, 2021 •

edited

cessen commented Jun 11, 2021 •

edited

Panic on non-UTF-8 files #18

Panic on non-UTF-8 files #18

Comments

vv9k commented Jun 1, 2021

cessen commented Jun 1, 2021

L-as commented Jun 1, 2021

archseer commented Jun 2, 2021

cessen commented Jun 2, 2021

kirawi commented Jun 10, 2021

pickfire commented Jun 10, 2021 • edited

cessen commented Jun 10, 2021

pickfire commented Jun 11, 2021

kirawi commented Jun 11, 2021

archseer commented Jun 11, 2021

cessen commented Jun 11, 2021

kirawi commented Jun 11, 2021 • edited

kirawi commented Jun 11, 2021 • edited

cessen commented Jun 11, 2021 • edited

pickfire commented Jun 10, 2021 •

edited

kirawi commented Jun 11, 2021 •

edited

kirawi commented Jun 11, 2021 •

edited

cessen commented Jun 11, 2021 •

edited