Linebreak handling control #534

Closed
wants to merge 2 commits into
from

Projects

None yet

3 participants

favonia commented Jun 9, 2012

There are languages (Chinese, Japanese, etc) that do not use spaces to separate words. This is my hack to add a new option --linebreak to change the default newline conversion for markdown input. It can be either space (the default, original behavior), skip, or preserve now. I'd like to hear your opinions on this and work on this patch more. For example make this option work for other kinds of input.

Ideally we should be able to mark a particular region with a particular language, and the conversion should be context-aware. I only implemented the basic control here, and it's valid only for markdown input. For related information one can see text-space-collapse in current CSS3 draft and some question on stackoverflow.

Owner
jgm commented Jul 23, 2012

Given that a text may contain parts in Chinese/Japanese and parts in English, I wonder whether a switch that affects the whole document like this makes the most sense. (Though certainly others have asked for a switch that makes linebreaks into hard breaks.)

Another possibility would be this: if the next nonspace character after the newline is a Chinese/Japanese character, then the newline is automatically skipped? What do you think? (Ideally we'd also check the character before, but this would require changes to the parser that I think would affect performance.)

favonia commented Jul 23, 2012

I share your concern about multilingual documents. I already inserted lots of spaces in my posts mixing English and Chinese. However, I'm not totally convinced by your solution. It's quite common in my posts that a Chinese character is followed by an English word. Also the parser would become a little bit difficult to explain.

I have several random ideas: First, I am thinking of a toggle like gcc #param pack that you can push and pop linebreak configurations. Second, it seems we are making an inverse algorithm for Unicode line breaking algorithm. The goal can be interpreting X as Y iff Y is the string with minimum linebreaks that the Unicode algorithm will format Y as X. What would you think? I am still studying the Unicode standard, though.

Owner
jgm commented Jul 23, 2012

+++ favonia [Jul 22 12 20:03 ]:

I share your concern about multilingual documents. I already inserted lots of spaces in my posts mixing English and Chinese. However, I'm not totally convinced by your solution. It's quite common in my posts that a Chinese character is followed by an English word. Also the behavior of parser would become a little bit difficult to explain.

I have several random ideas: First, I am thinking of a toggle like gcc #param pack that you can push and pop linebreak configurations. Second, it seems we are making an inverse algorithm for Unicode line breaking algorithm. The goal can be interpreting X as Y iff Y is the string with minimum linebreaks that the Unicode algorithm will format Y as X. What would you think? I am still studying the Unicode standard, though.

That sounds too complicated and would likely be slow.

Maybe the thing to do is just to add a command-line flag that makes all
line breaks into hard breaks.

It seems to me that this would be good enough for writers who write
Chinese; they can either choose to use line breaks in formatting their
text or not, but either way no extra spaces will be inserted. I don't
see much point in offering a version that just ignores the line breaks.

It would also be useful for some other purposes.

Finally, if your text is mixed, with both Chinese and English, say, then
you can just avoid using hard line breaks in the English text.

Thoughts?

favonia commented Jul 23, 2012

The reason to support hand-wrapped text paragraphs is because it looks better. That's actually one of the principles of Markdown. What I am doing here is to extend this principle to other writing systems, as the default Markdown is biased toward Western writing systems. I think your proposal would make sense if Markdown didn't support hand-wrapped text paragraphs from the very beginning, where all line breaks were kept.

BTW I think the stack-like options is not difficult to implement; this line

stateLineBreakConv   :: LineBreakConv

can be changed to

stateLineBreakConv   :: [LineBreakConv]

and we only look at the head. I guess this won't hurt performance too much (compared to my current hack). The up side of this approach is that we can change the line break setting on the fly and paragraphs remain composable. The down side is that we need to extend the syntax carefully, because an explicit option might break another principle of Markdown (being publishable as it is). For example the following option works but clearly indicates that the document is mark upped.

%% linebreak push raw

Words
with
newlines

%% linebreak pop

An simpler alternative to this is to give up composability and have

%% linebreak raw

Words
with
newlines

%% linebreak space

instead. In this case stateLineBreakConv :: LineBreakConv suffices.

For the Unicode algorithm I guess only the document is overly complicated. Also I think an approximation is good enough. I like your idea that ignores the space automatically, but I might want something smarter.

In sum, I think these are good approaches:

  1. stack-like linebreak options (costs: extra markups)
  2. mutable linebreak options (costs: extra markups and composability, but simpler than 1)
  3. smart linebreak conversion (costs: speed, needs another option to support hard breaks)
  4. 1+3 or 2+3? that is, an stack-like or mutable option to turn off smart linebreak conversion

Personally I would like to test the smart converter. I understand your concern about speed but I would like to try it out. How should I test the parsing speed? What's the standard benchmark that I can measure the impact on speed?

Owner
jgm commented Jul 23, 2012

+++ favonia [Jul 23 12 06:41 ]:

In sum, I think these are good approaches:

  1. stack-like linebreak options (costs: extra markups)
  2. mutable linebreak options (costs: extra markups and composability, but simpler than 1)
  3. smart linebreak conversion (costs: speed, needs another option to support hard breaks)
  4. 1+3 or 2+3? that is, an stack-like or mutable option to turn off smart linebreak conversion

I don't like the ideas that involve non-textual markup codes. The whole
idea of markdown is that it should be readable and not cluttered up with
such things.

But 3 may be worth exploring further.

Personally I would like to test the last options. I understand your concern about speed but I would like to try it out. How should I test the parsing speed? What's the standard benchmark that I can measure the impact on speed?

There's a Benchmark.hs in the main directory that you can use.
I should really hook this into the cabal file.

favonia commented Jul 23, 2012

Cool. I think we at least agree that an option to support hard breaks will be useful. As for the interface, how about changing the linebreak state to

data LineBreakConv
    = LineBreakSpace -- ^ Convert newline characters to spaces
    | LineBreakSmart -- ^ Use some heuristics to handle newline characters in multilingual text
    | LineBreakPreserve -- ^ Preserve all newline characters

where LineBreakSmart refers to the algorithm we were discussing? Hope I can figure out a nice algorithm soon.

BTW, I'm aware that my English is terrible so I'm all ears on better descriptions and better names for these options and data types.

Owner
jgm commented Jul 23, 2012

That sounds good, at least for now. We can always change the data type
names later, once the algorithm has been proven to work.

+++ favonia [Jul 23 12 10:12 ]:

Cool. I think we at least agree that an option to support hard breaks will be useful. As for the interface, how about changing the linebreak state to

data LineBreakConv
    = LineBreakSpace -- ^ Convert newline characters to spaces
    | LineBreakSmart -- ^ Use some heuristics to handle newline characters in multilingual text
    | LineBreakPreserve -- ^ Preserve all newline characters

where LineBreakSmart refers to the algorithm we were discussing? Hope I can figure out a nice algorithm soon.

BTW, I'm aware that my English is terrible so I'm all ears on better descriptions and better names for these options and data types.


Reply to this email directly or view it on GitHub:
#534 (comment)

Finally, if your text is mixed, with both Chinese and English, say, then
you can just avoid using hard line breaks in the English text.

or avoid using soft line breaks in Chinese text. This is the solution I'm going with atm.

Owner
jgm commented Feb 22, 2017

I believe this can be closed, now that we have the ignore_line_breaks and east_asian_line_breaks extensions.

@jgm jgm closed this Feb 22, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment