Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Panic on non-UTF-8 files #18

Closed
vv9k opened this issue Jun 1, 2021 · 14 comments · Fixed by #228
Closed

Panic on non-UTF-8 files #18

vv9k opened this issue Jun 1, 2021 · 14 comments · Fixed by #228
Assignees

Comments

@vv9k
Copy link
Contributor

vv9k commented Jun 1, 2021

For example trying to open the hx binary:

❯ hx target/debug/hx
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: stream did not contain valid UTF-8', helix-term/src/main.rs:117:46
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

I don't know if there is a plan to support other text encodings but it sure would be nice to just be able to view a raw binary file as a fallback. The current implementation of Document uses Rope internally, from what I've seen it accepts only valid UTF-8 so not really sure how it could be handled here.

@vv9k vv9k changed the title Panic on non-utf8 files Panic on non-UTF-8 files Jun 1, 2021
@cessen
Copy link
Contributor

cessen commented Jun 1, 2021

Some thoughts below, from a semi-random passer-by:

For other text encodings: Ropey (the rope library Helix is using) is designed for streaming transcoding on both load and save. So using it in combination with something like the encoding_rs crate to do the transcoding, and some reasonable auto-detection of text encodings, would handle most valid text files you might encounter. (It won't handle all esoteric corner-cases, of course, because the text encoding landscape is... complicated. But it will at least handle most common encodings in a reasonable way.)

For binary files: there's not really an obvious "right" thing for a text editor to do, so it's more about just picking an option and going with it. Some possibilities include:

  • Including a hex-editor mode. This is what e.g. Sublime Text does.
  • Transcode it as latin-1, which round-trips losslessly for any file but displays the data as garbled text.
  • Refuse to open the file, with a message explaining that it's either binary or an unsupported text encoding. (And possibly suggest opening it in a hex editor.)

@L-as
Copy link

L-as commented Jun 1, 2021

Optimally it would show it as UTF-8, but show some placeholder for incorrect encodings like in kakoune. Lossless round-trips aren't necessary IMO since often you don't want to edit the file, and if you do, it's often because you want to remove the incorrectly encoded parts.

@archseer
Copy link
Member

archseer commented Jun 2, 2021

(Hey @cessen 👋🏻 Thank you for building Ropey, it's great!)

So far I've made no effort to support anything other than UTF-8. I think moving forward we could use encoding_rs + hsivonen/chardetng for encoding detection, but I'd probably maintain a whitelist of encodings we'd allow. I've gotten hit by esoteric edge cases before where the encoding was under-specified.

@cessen
Copy link
Contributor

cessen commented Jun 2, 2021

(Hey @cessen 👋🏻 Thank you for building Ropey, it's great!)

You're very welcome! If you end up running into any problems with it, please don't hesitate to file an issue. :-)

but I'd probably maintain a whitelist of encodings we'd allow.

Yeah, I think that's a very reasonable approach. Supporting a curated set of encodings that you're confident work correctly is better than trying to support everything unreliably.

@kirawi
Copy link
Member

kirawi commented Jun 10, 2021

I'll take a stab at this.

@pickfire
Copy link
Contributor

pickfire commented Jun 10, 2021

A discussion happened at cessen/ropey#40, @cessen suggested that we use RopeBuilder if we intend to read from different file encoding, we may also want to do our own buffering if we do it that way.

@cessen
Copy link
Contributor

cessen commented Jun 10, 2021

To expand on what @pickfire said: I think using RopeBuilder is just generally the better way to go for Helix, even aside from transcoding. It will let you robustly handle a lot of things that from_reader() won't. from_reader() is really just a convenience method for more casual use, quick prototyping, etc.

For example, another good use-case for RopeBuilder is loading very large files. from_reader() will just block until the whole file is loaded, which isn't great if it takes a while. But using RopeBuilder + handling IO yourself, you can do things like showing a progress bar and letting the user cancel if it's taking too long.

@pickfire
Copy link
Contributor

For example, another good use-case for RopeBuilder is loading very large files. from_reader() will just block until the whole file is loaded, which isn't great if it takes a while. But using RopeBuilder + handling IO yourself, you can do things like showing a progress bar and letting the user cancel if it's taking too long.

Even loading sqlite.c (5.2MB - 5385123 loc) is just taking roughly one second to load the file and start even in debug build. I wonder if we will get into cases where we need a progress bar for it. Anyone have any idea which file ls larger to test?

@kirawi
Copy link
Member

kirawi commented Jun 11, 2021

For example, another good use-case for RopeBuilder is loading very large files. from_reader() will just block until the whole file is loaded, which isn't great if it takes a while. But using RopeBuilder + handling IO yourself, you can do things like showing a progress bar and letting the user cancel if it's taking too long.

Even loading sqlite.c (5.2MB - 5385123 loc) is just taking roughly one second to load the file and start even in debug build. I wonder if we will get into cases where we need a progress bar for it. Anyone have any idea which file ls larger to test?

Maybe a huge XML file?

@archseer
Copy link
Member

Note: we're also discussing large file loading with regards to Led here: #219

@cessen
Copy link
Contributor

cessen commented Jun 11, 2021

Here are a couple of test files I use when testing my own editor:

100mb.txt.zip (download is ~450 KB, unzipped ~100 MB)
1gb.txt.zip (download is ~4.5 MB, unzipped ~1 GB)

Having said that, even the 1 GB file loads in ~1 second in my editor (which also uses Ropey). But I've heard that some people open 10-20GB (or even larger) log files sometimes, so that would start to become pretty significant wait times. And when transcoding on load because a file isn't utf8, those times will almost certainly go up quite a bit as well.

@kirawi
Copy link
Member

kirawi commented Jun 11, 2021

Seems like we might want a wrapper over RopeBuilder for our usecase? Or does that already exist with Document?

@kirawi
Copy link
Member

kirawi commented Jun 11, 2021

To expand on what @pickfire said: I think using RopeBuilder is just generally the better way to go for Helix, even aside from transcoding. It will let you robustly handle a lot of things that from_reader() won't. from_reader() is really just a convenience method for more casual use, quick prototyping, etc.

For example, another good use-case for RopeBuilder is loading very large files. from_reader() will just block until the whole file is loaded, which isn't great if it takes a while. But using RopeBuilder + handling IO yourself, you can do things like showing a progress bar and letting the user cancel if it's taking too long.

This is completely new to me, so I have a few stupid questions to ask if you don't mind. How would you use RopeBuilder to open a file in a non-blocking way? From the example you gave in the other thread, it would block until the entire file was read, wouldn't it?

@cessen
Copy link
Contributor

cessen commented Jun 11, 2021

How would you use RopeBuilder to open a file in a non-blocking way?

The short answer is simply that RopeBuilder doesn't handle IO, so you can do the IO yourself however you want (sync, async, hand-writing an event loop that periodically updates the UI and and checks for input, or whatever).

For example, you could do something like this (pseudo-code):

let mut reader = FancyBufReader::new(my_file);
let mut builder = RopeBuilder::new();

while let Ok(text_chunk) = reader.read_next_str_chunk_of_max_length(4096) {
    builder.append(text_chunk);
    ui_update_progress_bar();
    if user_input_cancel() {
        return None;
    }
}

return Some(builder.finish());

This does IO incrementally, a chunk at a time, and updates the UI and checks user input as it goes.

But RopeBuilder has nothing to do with IO at all, so you could write something similar to this but using async IO primitives or whatever instead. RopeBuilder doesn't care, it just takes str slices.

From the example you gave in the other thread, it would block until the entire file was read, wouldn't it?

I'm not sure exactly which example you're referring to, but if it calls from_reader(), then yes. from_reader() always blocks until the entire file is read (errors notwithstanding)

Hope that helps!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants