Linebreak handling control #534

favonia · 2012-06-09T23:49:51Z

There are languages (Chinese, Japanese, etc) that do not use spaces to separate words. This is my hack to add a new option --linebreak to change the default newline conversion for markdown input. It can be either space (the default, original behavior), skip, or preserve now. I'd like to hear your opinions on this and work on this patch more. For example make this option work for other kinds of input.

Ideally we should be able to mark a particular region with a particular language, and the conversion should be context-aware. I only implemented the basic control here, and it's valid only for markdown input. For related information one can see text-space-collapse in current CSS3 draft and some question on stackoverflow.

Also update README to document new features.

jgm · 2012-07-23T01:38:53Z

Given that a text may contain parts in Chinese/Japanese and parts in English, I wonder whether a switch that affects the whole document like this makes the most sense. (Though certainly others have asked for a switch that makes linebreaks into hard breaks.)

Another possibility would be this: if the next nonspace character after the newline is a Chinese/Japanese character, then the newline is automatically skipped? What do you think? (Ideally we'd also check the character before, but this would require changes to the parser that I think would affect performance.)

favonia · 2012-07-23T03:03:06Z

I share your concern about multilingual documents. I already inserted lots of spaces in my posts mixing English and Chinese. However, I'm not totally convinced by your solution. It's quite common in my posts that a Chinese character is followed by an English word. Also the parser would become a little bit difficult to explain.

I have several random ideas: First, I am thinking of a toggle like gcc #param pack that you can push and pop linebreak configurations. Second, it seems we are making an inverse algorithm for Unicode line breaking algorithm. The goal can be interpreting X as Y iff Y is the string with minimum linebreaks that the Unicode algorithm will format Y as X. What would you think? I am still studying the Unicode standard, though.

jgm · 2012-07-23T05:15:46Z

+++ favonia [Jul 22 12 20:03 ]:

I share your concern about multilingual documents. I already inserted lots of spaces in my posts mixing English and Chinese. However, I'm not totally convinced by your solution. It's quite common in my posts that a Chinese character is followed by an English word. Also the behavior of parser would become a little bit difficult to explain.

I have several random ideas: First, I am thinking of a toggle like gcc #param pack that you can push and pop linebreak configurations. Second, it seems we are making an inverse algorithm for Unicode line breaking algorithm. The goal can be interpreting X as Y iff Y is the string with minimum linebreaks that the Unicode algorithm will format Y as X. What would you think? I am still studying the Unicode standard, though.

That sounds too complicated and would likely be slow.

Maybe the thing to do is just to add a command-line flag that makes all
line breaks into hard breaks.

It seems to me that this would be good enough for writers who write
Chinese; they can either choose to use line breaks in formatting their
text or not, but either way no extra spaces will be inserted. I don't
see much point in offering a version that just ignores the line breaks.

It would also be useful for some other purposes.

Finally, if your text is mixed, with both Chinese and English, say, then
you can just avoid using hard line breaks in the English text.

Thoughts?

favonia · 2012-07-23T13:41:26Z

The reason to support hand-wrapped text paragraphs is because it looks better. That's actually one of the principles of Markdown. What I am doing here is to extend this principle to other writing systems, as the default Markdown is biased toward Western writing systems. I think your proposal would make sense if Markdown didn't support hand-wrapped text paragraphs from the very beginning, where all line breaks were kept.

BTW I think the stack-like options is not difficult to implement; this line

stateLineBreakConv   :: LineBreakConv

can be changed to

stateLineBreakConv   :: [LineBreakConv]

and we only look at the head. I guess this won't hurt performance too much (compared to my current hack). The up side of this approach is that we can change the line break setting on the fly and paragraphs remain composable. The down side is that we need to extend the syntax carefully, because an explicit option might break another principle of Markdown (being publishable as it is). For example the following option works but clearly indicates that the document is mark upped.

%% linebreak push raw

Words
with
newlines

%% linebreak pop

An simpler alternative to this is to give up composability and have

%% linebreak raw

Words
with
newlines

%% linebreak space

instead. In this case stateLineBreakConv :: LineBreakConv suffices.

For the Unicode algorithm I guess only the document is overly complicated. Also I think an approximation is good enough. I like your idea that ignores the space automatically, but I might want something smarter.

In sum, I think these are good approaches:

stack-like linebreak options (costs: extra markups)
mutable linebreak options (costs: extra markups and composability, but simpler than 1)
smart linebreak conversion (costs: speed, needs another option to support hard breaks)
1+3 or 2+3? that is, an stack-like or mutable option to turn off smart linebreak conversion

Personally I would like to test the smart converter. I understand your concern about speed but I would like to try it out. How should I test the parsing speed? What's the standard benchmark that I can measure the impact on speed?

jgm · 2012-07-23T16:46:59Z

+++ favonia [Jul 23 12 06:41 ]:

In sum, I think these are good approaches:

stack-like linebreak options (costs: extra markups)

mutable linebreak options (costs: extra markups and composability, but simpler than 1)

smart linebreak conversion (costs: speed, needs another option to support hard breaks)

1+3 or 2+3? that is, an stack-like or mutable option to turn off smart linebreak conversion

I don't like the ideas that involve non-textual markup codes. The whole
idea of markdown is that it should be readable and not cluttered up with
such things.

But 3 may be worth exploring further.

Personally I would like to test the last options. I understand your concern about speed but I would like to try it out. How should I test the parsing speed? What's the standard benchmark that I can measure the impact on speed?

There's a Benchmark.hs in the main directory that you can use.
I should really hook this into the cabal file.

favonia · 2012-07-23T17:12:00Z

Cool. I think we at least agree that an option to support hard breaks will be useful. As for the interface, how about changing the linebreak state to

data LineBreakConv
    = LineBreakSpace -- ^ Convert newline characters to spaces
    | LineBreakSmart -- ^ Use some heuristics to handle newline characters in multilingual text
    | LineBreakPreserve -- ^ Preserve all newline characters

where LineBreakSmart refers to the algorithm we were discussing? Hope I can figure out a nice algorithm soon.

BTW, I'm aware that my English is terrible so I'm all ears on better descriptions and better names for these options and data types.

jgm · 2012-07-23T17:29:03Z

That sounds good, at least for now. We can always change the data type
names later, once the algorithm has been proven to work.

+++ favonia [Jul 23 12 10:12 ]:

Cool. I think we at least agree that an option to support hard breaks will be useful. As for the interface, how about changing the linebreak state to
data LineBreakConv
    = LineBreakSpace -- ^ Convert newline characters to spaces
    | LineBreakSmart -- ^ Use some heuristics to handle newline characters in multilingual text
    | LineBreakPreserve -- ^ Preserve all newline characters
where LineBreakSmart refers to the algorithm we were discussing? Hope I can figure out a nice algorithm soon.

BTW, I'm aware that my English is terrible so I'm all ears on better descriptions and better names for these options and data types.

Reply to this email directly or view it on GitHub:
#534 (comment)

geyang · 2017-02-22T04:59:19Z

Finally, if your text is mixed, with both Chinese and English, say, then
you can just avoid using hard line breaks in the English text.

or avoid using soft line breaks in Chinese text. This is the solution I'm going with atm.

jgm · 2017-02-22T13:49:51Z

I believe this can be closed, now that we have the ignore_line_breaks and east_asian_line_breaks extensions.

jiucenglou · 2020-04-13T19:01:48Z

I believe this can be closed, now that we have the ignore_line_breaks and east_asian_line_breaks extensions.

I have markdown files that actually use spaces to separate groups of Chinese words. Presumably, the spaces were used to make the content easier to understand. Now I would like to convert the markdown files to Word, ignore spaces between Chinese characters before converting to Word. Could you help to comment whether pandoc can handle this case via something such as east_asian_word_breaks ?

favonia added 2 commits July 18, 2012 09:46

Add linebreak handling to Markdown reader.

7412955

Add command-line option.

b84882a

Also update README to document new features.

tarleb mentioned this pull request Aug 19, 2016

md/html/txt => pdf - conversion issues for japanese language #3080

Closed

IllyaMoskvin mentioned this pull request Nov 26, 2016

Add option to turn <br> tags into {newline} instead of {space}{space}{newline} thephpleague/html-to-markdown#112

Closed

geyang mentioned this pull request Feb 22, 2017

Softbreak rendering in CJK languanges markdown-it/markdown-it#334

Closed

jgm closed this Feb 22, 2017

zhangkaizhao mentioned this pull request Nov 9, 2018

Unwanted space characters in Japanese language asciidoctor/asciidoctor#1420

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Linebreak handling control #534

Linebreak handling control #534

favonia commented Jun 9, 2012

jgm commented Jul 23, 2012

favonia commented Jul 23, 2012

jgm commented Jul 23, 2012

favonia commented Jul 23, 2012

jgm commented Jul 23, 2012

favonia commented Jul 23, 2012

jgm commented Jul 23, 2012

geyang commented Feb 22, 2017

jgm commented Feb 22, 2017

jiucenglou commented Apr 13, 2020

Linebreak handling control #534

Linebreak handling control #534

Conversation

favonia commented Jun 9, 2012

jgm commented Jul 23, 2012

favonia commented Jul 23, 2012

jgm commented Jul 23, 2012

favonia commented Jul 23, 2012

jgm commented Jul 23, 2012

favonia commented Jul 23, 2012

jgm commented Jul 23, 2012

geyang commented Feb 22, 2017

jgm commented Feb 22, 2017

jiucenglou commented Apr 13, 2020