Added support for custom width providers #128

Moxinilian · 2018-04-01T14:28:06Z

This is part of the ongoing effort to support #126
This PR offers a way to set custom (integer) widths for characters. It does not add support for floating point numbers yet.
This allows the user to pass to most functions a closure taking a char and returning its width as a usize.
I tried to limit API changes for simple tasks by leaving the fill and wrap functions intact in signature, and added wrap_dynamic and fill_dynamic that take the closure.
I was not really inspired with those names, if you want me to change them to something else I would be glad to do it.
I also chose to re-export unicode_width::UnicodeWidthChar as it will probably come in handy for people writing custom closures, avoiding them to depend on unicode_width.

Example:

let wrap = Wrapper::new(15);
let res = wrap.wrap_dynamic("Concurrency without data races.", |c| {
    if c == 'h' {
        c.width().unwrap_or(0) + 4
    } else {
        c.width().unwrap_or(0)
    }
});
assert_eq!(res, vec!["Concurrency", "without", "data races."]);

This example illustrates an hypothetical case where an h would actually take 5 columns instead of 1.
I included it as a test, with its equivalent for Wrapper::fill_dynamic.

Benchmark:

before
test fill_100 ... bench:         539 ns/iter (+/- 115)
test fill_200 ... bench:       1,188 ns/iter (+/- 141)
test fill_400 ... bench:       2,416 ns/iter (+/- 329)
test fill_800 ... bench:       5,349 ns/iter (+/- 974)
test wrap_100 ... bench:         639 ns/iter (+/- 111)
test wrap_200 ... bench:       1,397 ns/iter (+/- 206)
test wrap_400 ... bench:       2,729 ns/iter (+/- 405)
test wrap_800 ... bench:       5,575 ns/iter (+/- 664)

after
test fill_100 ... bench:         554 ns/iter (+/- 139)
test fill_200 ... bench:       1,199 ns/iter (+/- 145)
test fill_400 ... bench:       2,409 ns/iter (+/- 1,570)
test fill_800 ... bench:       5,142 ns/iter (+/- 658)
test wrap_100 ... bench:         632 ns/iter (+/- 113)
test wrap_200 ... bench:       1,430 ns/iter (+/- 221)
test wrap_400 ... bench:       2,732 ns/iter (+/- 323)
test wrap_800 ... bench:       5,417 ns/iter (+/- 456)

Zero-cost abstraction really is awesome.

Moxinilian · 2018-04-01T16:23:57Z

By the way this crate definitely deserves a refactor, it's getting quite big and it's a bit hard to work on it when it's only one single big file full of documentation.

mgeisler · 2018-04-06T16:44:09Z

Hey @Moxinilian, thanks for the PR. It's not been forgotten, I've just been busy lately...

I'm traveling this weekend, so it won't be before next week that I get proper time to look at this.

Just some short remarks: I love that you could do this via a generic function parameter --- zero-cost abstractions are awesome, as you say :-)

About measuring the characters: would it perhaps be better to measure words at a time?

With combining characters, I guess you don't really know the true displayed width of a word until you typeset the entire thing. Take emoticons as an extreme example since I'm typing this on my phone: I recently learned that a symbol like 👩‍👩‍👧‍👦 is created at rendering time --- it's really multiple Unicode characters with zero-width-joiners between them, something like 👧👧👨👨 next to each other. See https://emojipedia.org/emoji-zwj-sequences/.

So if you measure the width of this char by char, then you get the wrong result. I think other combining characters will have the same problem.

Finally, you're definitely right that the crate should be refactored by now... It was my first larger piece of Rust code, so I'm happy to learn how to structure it better!

Moxinilian · 2018-04-06T17:22:10Z

You're right to bring this up because I thought about it and forgot to mention it.
I initially thought about arabic characters that can merge and change shape. So I wanted to have the closure take an &str instead. However, chars do not have the same memory representation as str in Rust. Therefore, casting one to the other requires to reallocate memory for every character, which can be expensive (at least more expensive than what is achieved now, so I did not dare do it).

But as I'm writing this I'm realizing that we could simply stop iterating over characters and make it split words. Even better, this approach would feel very natural in a functional programming style, which is in my opinion better in general, and unless you think otherwise, would really be appropriate. But then, what is a word? Where does it begin and where does it end? Is ending punctuation such as colons or interrogation marks part of a word or separate? And how would we handle hyphenation? It raises a lot of questions that I would struggle to answer.

Maybe we could try to disallow hyphenation on custom lengths and use spaces as the only word delimiters. But then that would make more conditions and increase the code complexity, while also probably making it quite inelegant.
Tell me what you feel is the best approach so I can try to implement it.

Feel free to also check out my concerns about support of floating points in issue #126.

Moxinilian · 2018-04-07T13:32:24Z

I'm going to revert the changes and try to implement that instead, as there would be no point in merging this now that we're going down the word route.

mgeisler · 2018-04-07T21:18:35Z

My comments about combining characters might have been distracting and unhelpful since it's the current code that suffers from those problems -- it already works char by char and does not account for combining characters at all.

Looking at the unicode-width crate, I see that it has code like:

impl UnicodeWidthStr for str {
    fn width(&self) -> usize {
        self.chars().map(|c| cw::width(c, false).unwrap_or(0)).fold(0, Add::add)
    }
}

This means that the char by char approach is perfectly fine when we use that crate to measure the column width of the monospace characters.

However, I like the idea of making this more generally applicable, so I would welcome a refactor in the direction of scanning for split points (breakable whitespace) and then measuring the width of these words on a per-word basis. The end result should be identical to what we do today, but it should be more adaptable.

Moxinilian · 2018-04-07T21:21:43Z

Alright then. I now remember why I accepted it anyway: the reason I'm working on it is Piston, and piston-graphics seems to also be fine with computing width character by character.
In fact, I was trying to find good ways to rewrite the algorithm for str, and it really was clunky. So I believe it is acceptable for now.
However, my project will need to support emojis in a way. So I guess sooner or later I will have to go back to it...

Moxinilian · 2018-04-07T21:24:53Z

I'm going to read this document to verify easy solutions have not already been found.
http://unicode.org/reports/tr14/

mgeisler · 2018-04-08T19:40:48Z

I'm thinking I'll make a 0.10.0 release now (after #129 got reported) and then we can merge this in and play a bit with the API and do refactorings and whatnot for a later 0.11.0 release. Hope that seems okay!

robinkrahl · 2020-10-18T16:11:18Z

I’m also interested in this feature to be able to use textwrap with PDF documents (in genpdf). What is the current status of this PR? Is there anything I can do to help?

mgeisler · 2020-10-18T16:48:56Z

Hi @robinkrahl, I think this PR has stalled, but I would still like to see this kind of feature implemented. @Moxinilian, if you're still interested in this, then please let us know!

I recently rewrote some of the inner parts of textwrap (#213), so I think this PR should be closed and then a new one could be opened to fit with the new API. Ideally, this change should be invisible at the surface-level of the API, so that one can continue to write

let lines = wrap("some text", 80);

Perhaps the WrapOptions trait could gain a width(&str) -> T method, where T can be usize, float, or even u16 as needed (I mention u16 since the terminal_size crate measures the width in u16, which makes me think that it's completely overkill to use usize all over).

Let me close this PR for now and then we can perhaps discuss more in #126.

Added support for custom width providers

74d94fe

Théo Degioanni added 3 commits April 1, 2018 20:20

forgotten those

c5673cb

forgot other ones... I am getting old

7469aaf

Unused import

6351b79

Moxinilian closed this Apr 7, 2018

Moxinilian reopened this Apr 7, 2018

mgeisler mentioned this pull request May 22, 2018

Handle ANSI color codes #140

Closed

mgeisler closed this Oct 18, 2020

mgeisler mentioned this pull request Oct 18, 2020

Allow the use of the library outside of the context of the terminal #126

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added support for custom width providers #128

Added support for custom width providers #128

Moxinilian commented Apr 1, 2018 •

edited

Loading

Moxinilian commented Apr 1, 2018 •

edited

Loading

mgeisler commented Apr 6, 2018

Moxinilian commented Apr 6, 2018 •

edited

Loading

Moxinilian commented Apr 7, 2018

mgeisler commented Apr 7, 2018 •

edited

Loading

Moxinilian commented Apr 7, 2018 •

edited

Loading

Moxinilian commented Apr 7, 2018

mgeisler commented Apr 8, 2018

robinkrahl commented Oct 18, 2020

mgeisler commented Oct 18, 2020

Added support for custom width providers #128

Added support for custom width providers #128

Conversation

Moxinilian commented Apr 1, 2018 • edited Loading

Moxinilian commented Apr 1, 2018 • edited Loading

mgeisler commented Apr 6, 2018

Moxinilian commented Apr 6, 2018 • edited Loading

Moxinilian commented Apr 7, 2018

mgeisler commented Apr 7, 2018 • edited Loading

Moxinilian commented Apr 7, 2018 • edited Loading

Moxinilian commented Apr 7, 2018

mgeisler commented Apr 8, 2018

robinkrahl commented Oct 18, 2020

mgeisler commented Oct 18, 2020

Moxinilian commented Apr 1, 2018 •

edited

Loading

Moxinilian commented Apr 1, 2018 •

edited

Loading

Moxinilian commented Apr 6, 2018 •

edited

Loading

mgeisler commented Apr 7, 2018 •

edited

Loading

Moxinilian commented Apr 7, 2018 •

edited

Loading