Introduce wrapping using an optimal-fit algorithm #234

mgeisler · 2020-11-30T14:52:57Z

This PR introduces a new wrapping algorithm which finds a globally optimal set of line breaks, taking certain penalties into account. This is inspired by the line breaking algorithm used TeX, described in the 1981 article Breaking Paragraphs into Lines by Knuth and Plass. The implementation here is based on Python code by David Eppstein.

The wrapping algorithm which we’ve been using until now is a “greedy” or “first fit” algorithm with no look-ahead. It simply accumulates words until no more fit on the line. While simple and predictable, this algorithm can produce poor line breaks when a long word is moved to a new line, leaving behind a large gap.

The new “optimal fit” algorithm considers all possible break points and picks the breaks which minimizes the gaps at the end of each line. More precisely, the algorithm assigns a penalty to a break point, determined by (target_width - line_width)**2. As an example, if you’re wrapping at 80 columns, a line with 78 characters has a penalty of 4, but a line that with only 75 characters has a penalty of 25. Shorter lines are thus penalized more heavily.

The overall optimization minimizes the sum of the squares. The effect is that the algorithm will move short words down to subsequent lines if it lowers the total cost for the paragraph. This can be seen in action if we wrap the text “To be, or not to be: that is the question” in a narrow column with room for only 10 characters.

The greedy algorithm will produce these lines, each annotated with the corresponding penalty:

"To be, or"   1² =  1
"not to be:"  0² =  0
"that is"     3² =  9
"the"         7² = 49
"question"    2² =  4

We see that line four with “the” leaves a gap of 7 columns, which gives it a penalty of 49. The sum of the penalties is 63.

With an optimal wrapping algorithm, the first line is shortened in order to ensure that line four has a smaller gap:

"To be,"     4² = 16
"or not to"  1² =  1
"be: that"   2² =  4
"is the"     4² = 16
"question"   2² =  4

This time the sum of the penalties is 41, so the algorithm will prefer these break points over the first ones.

The full algorithm is slightly more complex than this, e.g., lines longer than the line width are penalized heavily to suppress them. Additionally, hyphens are penalized to ensure they only occur when they improve the breaks substantially.

If a paragraph has n places where line breaks can occur, there are potentially 2**n different ways to typeset it. Searching through all possible combinations would be prohibitively slow. However, it turns out that the problem can be formulated as the task of finding column minima in a cost matrix. This matrix has a special form (totally monotone) which lets us use a linear-time algorithm called SMAWK3 to find the optimal break points.

This means that the time complexity remains O(n) where n is the number of words.

Benchmarking shows that wrapping a very long paragraph with ~300 words or 1600 characters take ~3.5 times as long as before. The first-fit algorithm took 19 microseconds, optimal-fit takes 72 microseconds. This seems more than fast enough, and I’ve thus made the optimal-fit algorithm the default. If desired, the best-fit algorithm can still be selected.

robinkrahl · 2020-12-02T15:01:03Z

This looks very interesting! Is the last line taken into account when calculating the penalties? And regarding overlong lines, is it possible that the optimal-fit algorithm produces overlong lines that would not be overlong when using the first-fit algorithm?

mgeisler · 2020-12-02T15:28:10Z

This looks very interesting!

Thanks! I've been playing with it for a while now, but only recently found the time to push it over the finish line :-)

Is the last line taken into account when calculating the penalties?

No, the last line does not get the gap * gap treatment. Though it does get a small penalty if it's shorter than 1/4 of the line width. This all happens in wrap_optimal_fit.

The logic here can be more or less complex — the original code simply added a penalty if the last line had a single word. However, I found that this looks odd in my small test cases where the last like might be baz from foo bar baz.

And regarding overlong lines, is it possible that the optimal-fit algorithm produces overlong lines that would not be overlong when using the first-fit algorithm?

Yes, this is possible. I got curious about it myself and create a pathological case to demonstrate it. Basically, if a line looks like this:

short and looooooooooooooooooooooooooong

then it's a question of the penalty paid for moving the long word onto the next line. Here I made the word 30 characters long. In the worst case, it is the very last g which makes the word not fit and we end up with a gap of 30 on the first line:

short and
looooooooooooooooooooooooooong

The gap has a penalty of 900. In addition, there is a 1000 penalty for every new line added, so the total solution costs 1900.

This is weighed against the alternative of letting the g overflow the first line:

short and looooooooooooooooooooooooooong

I've set the per-character penalty for overflow to 2500. So overflowing costs 2500 which is more than 1900 and thus we end up with two lines.

However, if the long word is 50 characters wide, then the same the cost of leaving a gap is 2500, which together with the per-line penalty changes the balance so that an overflow does happen.

The numbers are pretty arbitrary, though I played around with the interactive example program to see what the effect of the parameters are.

This introduces a new wrapping algorithm which finds a globally optimal set of line breaks, taking certain penalties into account. This is inspired by the line breaking algorithm used TeX, described in the 1981 article Breaking Paragraphs into Lines[1] by Knuth and Plass. The implementation here is based on Python code by David Eppstein[2]. The wrapping algorithm which we’ve been using until now is a “greedy” or “first fit” algorithm with no look-ahead. It simply accumulates words until no more fit on the line. While simple and predictable, this algorithm can produce poor line breaks when a long word is moved to a new line, leaving behind a large gap. The new “optimal fit” algorithm considers all possible break points and picks the breaks which minimizes the gaps at the end of each line. More precisely, the algorithm assigns a penalty to a break point, determined by (target_width - line_width)**2. As an example, if you’re wrapping at 80 columns, a line with 78 characters has a penalty of 4, but a line that with only 75 characters has a penalty of 25. Shorter lines are thus penalized more heavily. The overall optimization minimizes the sum of the squares. The effect is that the algorithm will move short words down to subsequent lines if it lowers the total cost for the paragraph. This can be seen in action if we wrap the text “To be, or not to be: that is the question” in a narrow column with room for only 10 characters. The greedy algorithm will produce these lines, each annotated with the corresponding penalty: "To be, or" 1² = 1 "not to be:" 0² = 0 "that is" 3² = 9 "the" 7² = 49 "question" 2² = 4 We see that line four with “the” leaves a gap of 7 columns, which gives it a penalty of 49. The sum of the penalties is 63. With an optimal wrapping algorithm, the first line is shortened in order to ensure that line four has a smaller gap: "To be," 4² = 16 "or not to" 1² = 1 "be: that" 2² = 4 "is the" 4² = 16 "question" 2² = 4 This time the sum of the penalties is 41, so the algorithm will prefer these break points over the first ones. The full algorithm is slightly more complex than this, e.g., lines longer than the line width are penalized heavily to suppress them. Additionally, hyphens are penalized to ensure they only occur when they improve the breaks substantially. If a paragraph has n places where line breaks can occur, there are potentially 2**n different ways to typeset it. Searching through all possible combinations would be prohibitively slow. However, it turns out that the problem can be formulated as the task of finding minimal in a cost matrix. This matrix has a special form (totally monotone) which lets us use a linear-time algorithm called SMAWK[3] to find the optimal break points. This means that the time complexity remains O(n) where n is the number of words. Benchmarking shows that wrapping a very long paragraph with ~300 words or 1600 characters take ~3.5 times as long as before. The first-fit algorithm took 19 microseconds, optimal-fit takes 72 microseconds. This seems more than fast enough, and I’ve thus made the optimal-fit algorithm the default. If desired, the best-fit algorithm can still be selected. [1]: http://www.eprg.org/G53DOC/pdfs/knuth-plass-breaking.pdf [2]: https://github.com/jfinkels/PADS/blob/master/pads/wrap.py [3]: https://lib.rs/crates/smawk

robinkrahl · 2020-12-02T15:50:10Z

Thanks for the explanations! I’m currently thinking about whether it is possible to use this algorithm if I don’t know all line widths in advance. Maybe I can estimate an upper bound for the fragments that fit in the current area using the first-fit algorithm, re-wrap them using the optimal-fit algorithm and then choose the better result. I’ll try to run some experiments. Please let me know if you have any other ideas.

mgeisler · 2020-12-03T10:37:59Z

I started writing this in #126, but I think it applies better here...

The line height might depend on the fragments (different font sizes,
font families, formats, etc.). So the width of line n might depend on
which words have been assigned to lines 1..n-1.

Thanks for the explanation, that makes sense...

The cost function in wrap_optimal_fit is called with i and j and has to return the cost of a line with fragments[i..j]. It is given a third argument minima of type &[(usize, i32)] which reflect the previously computed minima (it was called values before, but I think minima is a better name).

More concretely, the usize at minima[j].0 tells you the index i so that fragments[i..j] minimize the total cost. The i32 at minima[j].1 is the cost of fragments[i..j]. At least that's how I read the original Python code :-)

At every invocation of the cost function, I use these minima to computing the line number for fragment i and I then use this line number to compute the current target width:

        // Line number for fragment `i`.
        let line_number = line_numbers.get(i, &minima);  // was &values
        let target_width = std::cmp::max(1, line_widths(line_number));

The line numbers are computed and cached by line_numbers. This is where we call the user-supplied line_widths function.

At this point, we don't know what the final line breaks will be for the whole paragraph — this depends on the jumps in the final minima vector:

    let mut lines = Vec::with_capacity(line_numbers.get(fragments.len(), &minima));
    let mut pos = fragments.len();
    loop {
        let prev = minima[pos].0;
        lines.push(&fragments[prev..pos]);
        pos = prev;
        if pos == 0 {
            break;
        }
    }

    lines.reverse();

However, when computing the cost for i and j, we do know the optimal way to break fragments[..i] into lines. This is given by the minima passed to the cost function and we already use them to compute the line number for fragment i.

One could in principle use the information in minima to fully typeset fragments[..i]. This would allow you to compute the precise height of those fragments and you could let this flow into the computation of line_widths(line_number).

Now, I'm not sure that this is a O(1) computation any longer. The amazing guarantee of the SMAWK machinery is that it will evaluate the cost function only O(n) times for a n word string. I had to introduce caching of line numbers to ensure that we can compute the cost function in constant time, yielding an overall linear time algorithm. I guess similar caching could be used if you typeset fragments[..i] completely when evaluating the cost of fragments[i..j].

mgeisler · 2020-12-03T14:23:18Z

I think there is potential to make this more flexible and smarter going forward... I'll merge this for now and make a release to get the new API into the hands of people sooner rather than later.

I feel we can adjust things in textwrap::core pretty liberally going forward — I've searched GitHub a bit and it seems that 99% of applications use textwrap::fill or textwrap::wrap. Only a few applications bother with adjusting the wrapping settings and so I think we can make a new breaking release in a month or two without too much concern.

robinkrahl · 2020-12-06T17:06:10Z

I think there is potential to make this more flexible and smarter going forward... I'll merge this for now and make a release to get the new API into the hands of people sooner rather than later.

Sounds good to me!

The line height might depend on the fragments (different font sizes,
font families, formats, etc.). So the width of line n might depend on
which words have been assigned to lines 1..n-1.

Thanks for the explanation, that makes sense...

Just to be clear, that was just a general thought that does not apply to my use case. (While the line height might change, the line width is constant for the current text area.) So while I think that such a feature might be useful for others, I personally don’t need it.

mgeisler force-pushed the optimal-fit-algorithm branch 9 times, most recently from 66796e0 to a568413 Compare December 2, 2020 13:05

mgeisler changed the title ~~Introduce wrapping using globally optimal breakpoints~~ Introduce wrapping using an optimal-fit algorithm Dec 2, 2020

mgeisler force-pushed the optimal-fit-algorithm branch from 2e425e6 to fb54a01 Compare December 2, 2020 13:23

mgeisler mentioned this pull request Dec 2, 2020

Allow the use of the library outside of the context of the terminal #126

Closed

mgeisler force-pushed the optimal-fit-algorithm branch from fb54a01 to bc48530 Compare December 2, 2020 15:32

Document splitting pipeline

bde9dee

mgeisler force-pushed the optimal-fit-algorithm branch from bc48530 to bde9dee Compare December 3, 2020 10:45

mgeisler merged commit 695a560 into master Dec 3, 2020

mgeisler mentioned this pull request Dec 3, 2020

Binary size could be smaller clap-rs/clap#1365

Open

mgeisler deleted the optimal-fit-algorithm branch January 30, 2021 16:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce wrapping using an optimal-fit algorithm #234

Introduce wrapping using an optimal-fit algorithm #234

mgeisler commented Nov 30, 2020 •

edited

robinkrahl commented Dec 2, 2020

mgeisler commented Dec 2, 2020

robinkrahl commented Dec 2, 2020

mgeisler commented Dec 3, 2020

mgeisler commented Dec 3, 2020

robinkrahl commented Dec 6, 2020

Introduce wrapping using an optimal-fit algorithm #234

Introduce wrapping using an optimal-fit algorithm #234

Conversation

mgeisler commented Nov 30, 2020 • edited

robinkrahl commented Dec 2, 2020

mgeisler commented Dec 2, 2020

robinkrahl commented Dec 2, 2020

mgeisler commented Dec 3, 2020

mgeisler commented Dec 3, 2020

robinkrahl commented Dec 6, 2020

mgeisler commented Nov 30, 2020 •

edited