Skip to content
This repository has been archived by the owner on Jul 11, 2019. It is now read-only.

Buggy behaviour when working with accentuated characters #38

Closed
WyohKnott opened this issue Jan 17, 2017 · 4 comments
Closed

Buggy behaviour when working with accentuated characters #38

WyohKnott opened this issue Jan 17, 2017 · 4 comments

Comments

@WyohKnott
Copy link

When working with filenames containing accentuated or weird characters, the {} is not replaced correctly.

For exemple:
seq 1 10 | parallel echo "Québec-q{}.webm"

gives:

Québec-1}.webm
Québec-2}.webm
Québec-3}.webm
Québec-4}.webm
Québec-5}.webm
Québec-6}.webm
Québec-7}.webm
Québec-8}.webm
Québec-9}.webm
Québec-10}.webm

instead of Québec-q1.webm and so on.

If there's more non-ascii characters, the program segfault:

seq 1 10 | RUST_BACKTRACE=1 parallel echo "Œuf_échaudé-q{}.webm"

gives

parallel: reading inputs from standard input
thread 'main' panicked at 'byte index 18 is not a char boundary; it is inside 'é' (bytes 17..19) of `echo Œuf_échaudé-q{}.webm`', /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/src/libcore/str/mod.rs:1771
stack backtrace:
   1:     0x560b9875899a - std::sys::imp::backtrace::tracing::imp::write::h9c41d2f69e5caabf
                        at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/src/libstd/sys/unix/backtrace/tracing/gcc_s.rs:42
   2:     0x560b98757cee - std::panicking::default_hook::{{closure}}::hcc803c8663cda123
                        at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/src/libstd/panicking.rs:351
   3:     0x560b98756fdb - std::panicking::rust_panic_with_hook::hffbc74969c7b5d87
                        at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/src/libstd/panicking.rs:367
                        at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/src/libstd/panicking.rs:555
   4:     0x560b98756b3f - std::panicking::begin_panic::hc4c5d184a1e3fb7c
                        at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/src/libstd/panicking.rs:517
   5:     0x560b98756ac9 - std::panicking::begin_panic_fmt::h34f5b320b0f94559
                        at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/src/libstd/panicking.rs:501
   6:     0x560b98765956 - core::panicking::panic_fmt::h1016b85b51d1931f
                        at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/src/libstd/panicking.rs:477
   7:     0x560b98766e4f - core::str::slice_error_fail::h02b27cb27b0f1c1d
                        at /buildslave/rust-buildbot/slave/nightly-dist-rustc-linux/build/src/libcore/str/mod.rs:1771
   8:     0x560b9875069f - parallel::main::h6c96215d2b4b63a7
   9:     0x560b9875334f - main
  10:     0x7f1a66c82400 - __libc_start_main
  11:     0x560b9871d649 - _start
  12:                0x0 - <unknown>
@WyohKnott
Copy link
Author

The "shifting" seems directiy correlated to the number of UTF-8 codepoints used by "special" characters.

For example the characters 💖 is composed of 4 codepoints, so the {} variable is shifted 3 characters to the left:

seq 1 10 | parallel echo "test_💖_-q{}.webm"
test_💖1q{}.webm
test_💖2q{}.webm
test_💖3q{}.webm
test_💖4q{}.webm
test_💖5q{}.webm
test_💖6q{}.webm
test_💖7q{}.webm
test_💖8q{}.webm
test_💖9q{}.webm
test_💖10q{}.webm

Somewhere in your code there must be an assumption 1 character = 1 codepoint, and it messes everything up for characters coded with more than 1 codepoint.

@mmstick
Copy link
Owner

mmstick commented Jan 17, 2017

The issue is in the tokenizer. This is the stage that strips out tokens like {} and converts them into their corresponding series of tokens.

@mmstick
Copy link
Owner

mmstick commented Jan 18, 2017

I'll have the fix uploaded soon. The fix is just manually incrementing the index value by the character's actual size using the len_utf8() method. Example:

let mut id = 0;
for character in data.chars() {
    /// actual work
    id += character.len_utf8();
}

@mmstick
Copy link
Owner

mmstick commented Jan 18, 2017

7053e3a

@mmstick mmstick closed this as completed Jan 18, 2017
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants