Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linebreak handling control #534

Closed
wants to merge 2 commits into from
Closed

Linebreak handling control #534

wants to merge 2 commits into from

Conversation

favonia
Copy link
Contributor

@favonia favonia commented Jun 9, 2012

There are languages (Chinese, Japanese, etc) that do not use spaces to separate words. This is my hack to add a new option --linebreak to change the default newline conversion for markdown input. It can be either space (the default, original behavior), skip, or preserve now. I'd like to hear your opinions on this and work on this patch more. For example make this option work for other kinds of input.

Ideally we should be able to mark a particular region with a particular language, and the conversion should be context-aware. I only implemented the basic control here, and it's valid only for markdown input. For related information one can see text-space-collapse in current CSS3 draft and some question on stackoverflow.

@jgm
Copy link
Owner

jgm commented Jul 23, 2012

Given that a text may contain parts in Chinese/Japanese and parts in English, I wonder whether a switch that affects the whole document like this makes the most sense. (Though certainly others have asked for a switch that makes linebreaks into hard breaks.)

Another possibility would be this: if the next nonspace character after the newline is a Chinese/Japanese character, then the newline is automatically skipped? What do you think? (Ideally we'd also check the character before, but this would require changes to the parser that I think would affect performance.)

@favonia
Copy link
Contributor Author

favonia commented Jul 23, 2012

I share your concern about multilingual documents. I already inserted lots of spaces in my posts mixing English and Chinese. However, I'm not totally convinced by your solution. It's quite common in my posts that a Chinese character is followed by an English word. Also the parser would become a little bit difficult to explain.

I have several random ideas: First, I am thinking of a toggle like gcc #param pack that you can push and pop linebreak configurations. Second, it seems we are making an inverse algorithm for Unicode line breaking algorithm. The goal can be interpreting X as Y iff Y is the string with minimum linebreaks that the Unicode algorithm will format Y as X. What would you think? I am still studying the Unicode standard, though.

@jgm
Copy link
Owner

jgm commented Jul 23, 2012

+++ favonia [Jul 22 12 20:03 ]:

I share your concern about multilingual documents. I already inserted lots of spaces in my posts mixing English and Chinese. However, I'm not totally convinced by your solution. It's quite common in my posts that a Chinese character is followed by an English word. Also the behavior of parser would become a little bit difficult to explain.

I have several random ideas: First, I am thinking of a toggle like gcc #param pack that you can push and pop linebreak configurations. Second, it seems we are making an inverse algorithm for Unicode line breaking algorithm. The goal can be interpreting X as Y iff Y is the string with minimum linebreaks that the Unicode algorithm will format Y as X. What would you think? I am still studying the Unicode standard, though.

That sounds too complicated and would likely be slow.

Maybe the thing to do is just to add a command-line flag that makes all
line breaks into hard breaks.

It seems to me that this would be good enough for writers who write
Chinese; they can either choose to use line breaks in formatting their
text or not, but either way no extra spaces will be inserted. I don't
see much point in offering a version that just ignores the line breaks.

It would also be useful for some other purposes.

Finally, if your text is mixed, with both Chinese and English, say, then
you can just avoid using hard line breaks in the English text.

Thoughts?

@favonia
Copy link
Contributor Author

favonia commented Jul 23, 2012

The reason to support hand-wrapped text paragraphs is because it looks better. That's actually one of the principles of Markdown. What I am doing here is to extend this principle to other writing systems, as the default Markdown is biased toward Western writing systems. I think your proposal would make sense if Markdown didn't support hand-wrapped text paragraphs from the very beginning, where all line breaks were kept.

BTW I think the stack-like options is not difficult to implement; this line

stateLineBreakConv   :: LineBreakConv

can be changed to

stateLineBreakConv   :: [LineBreakConv]

and we only look at the head. I guess this won't hurt performance too much (compared to my current hack). The up side of this approach is that we can change the line break setting on the fly and paragraphs remain composable. The down side is that we need to extend the syntax carefully, because an explicit option might break another principle of Markdown (being publishable as it is). For example the following option works but clearly indicates that the document is mark upped.

%% linebreak push raw

Words
with
newlines

%% linebreak pop

An simpler alternative to this is to give up composability and have

%% linebreak raw

Words
with
newlines

%% linebreak space

instead. In this case stateLineBreakConv :: LineBreakConv suffices.

For the Unicode algorithm I guess only the document is overly complicated. Also I think an approximation is good enough. I like your idea that ignores the space automatically, but I might want something smarter.

In sum, I think these are good approaches:

  1. stack-like linebreak options (costs: extra markups)
  2. mutable linebreak options (costs: extra markups and composability, but simpler than 1)
  3. smart linebreak conversion (costs: speed, needs another option to support hard breaks)
  4. 1+3 or 2+3? that is, an stack-like or mutable option to turn off smart linebreak conversion

Personally I would like to test the smart converter. I understand your concern about speed but I would like to try it out. How should I test the parsing speed? What's the standard benchmark that I can measure the impact on speed?

@jgm
Copy link
Owner

jgm commented Jul 23, 2012

+++ favonia [Jul 23 12 06:41 ]:

In sum, I think these are good approaches:

  1. stack-like linebreak options (costs: extra markups)
  2. mutable linebreak options (costs: extra markups and composability, but simpler than 1)
  3. smart linebreak conversion (costs: speed, needs another option to support hard breaks)
  4. 1+3 or 2+3? that is, an stack-like or mutable option to turn off smart linebreak conversion

I don't like the ideas that involve non-textual markup codes. The whole
idea of markdown is that it should be readable and not cluttered up with
such things.

But 3 may be worth exploring further.

Personally I would like to test the last options. I understand your concern about speed but I would like to try it out. How should I test the parsing speed? What's the standard benchmark that I can measure the impact on speed?

There's a Benchmark.hs in the main directory that you can use.
I should really hook this into the cabal file.

@favonia
Copy link
Contributor Author

favonia commented Jul 23, 2012

Cool. I think we at least agree that an option to support hard breaks will be useful. As for the interface, how about changing the linebreak state to

data LineBreakConv
    = LineBreakSpace -- ^ Convert newline characters to spaces
    | LineBreakSmart -- ^ Use some heuristics to handle newline characters in multilingual text
    | LineBreakPreserve -- ^ Preserve all newline characters

where LineBreakSmart refers to the algorithm we were discussing? Hope I can figure out a nice algorithm soon.

BTW, I'm aware that my English is terrible so I'm all ears on better descriptions and better names for these options and data types.

@jgm
Copy link
Owner

jgm commented Jul 23, 2012

That sounds good, at least for now. We can always change the data type
names later, once the algorithm has been proven to work.

+++ favonia [Jul 23 12 10:12 ]:

Cool. I think we at least agree that an option to support hard breaks will be useful. As for the interface, how about changing the linebreak state to

data LineBreakConv
    = LineBreakSpace -- ^ Convert newline characters to spaces
    | LineBreakSmart -- ^ Use some heuristics to handle newline characters in multilingual text
    | LineBreakPreserve -- ^ Preserve all newline characters

where LineBreakSmart refers to the algorithm we were discussing? Hope I can figure out a nice algorithm soon.

BTW, I'm aware that my English is terrible so I'm all ears on better descriptions and better names for these options and data types.


Reply to this email directly or view it on GitHub:
#534 (comment)

@geyang
Copy link

geyang commented Feb 22, 2017

Finally, if your text is mixed, with both Chinese and English, say, then
you can just avoid using hard line breaks in the English text.

or avoid using soft line breaks in Chinese text. This is the solution I'm going with atm.

@jgm
Copy link
Owner

jgm commented Feb 22, 2017

I believe this can be closed, now that we have the ignore_line_breaks and east_asian_line_breaks extensions.

@jiucenglou
Copy link

I believe this can be closed, now that we have the ignore_line_breaks and east_asian_line_breaks extensions.

I have markdown files that actually use spaces to separate groups of Chinese words. Presumably, the spaces were used to make the content easier to understand. Now I would like to convert the markdown files to Word, ignore spaces between Chinese characters before converting to Word. Could you help to comment whether pandoc can handle this case via something such as east_asian_word_breaks ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants