-
-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Linebreak handling control #534
Conversation
Also update README to document new features.
Given that a text may contain parts in Chinese/Japanese and parts in English, I wonder whether a switch that affects the whole document like this makes the most sense. (Though certainly others have asked for a switch that makes linebreaks into hard breaks.) Another possibility would be this: if the next nonspace character after the newline is a Chinese/Japanese character, then the newline is automatically skipped? What do you think? (Ideally we'd also check the character before, but this would require changes to the parser that I think would affect performance.) |
I share your concern about multilingual documents. I already inserted lots of spaces in my posts mixing English and Chinese. However, I'm not totally convinced by your solution. It's quite common in my posts that a Chinese character is followed by an English word. Also the parser would become a little bit difficult to explain. I have several random ideas: First, I am thinking of a toggle like gcc #param pack that you can push and pop linebreak configurations. Second, it seems we are making an inverse algorithm for Unicode line breaking algorithm. The goal can be interpreting X as Y iff Y is the string with minimum linebreaks that the Unicode algorithm will format Y as X. What would you think? I am still studying the Unicode standard, though. |
+++ favonia [Jul 22 12 20:03 ]:
That sounds too complicated and would likely be slow. Maybe the thing to do is just to add a command-line flag that makes all It seems to me that this would be good enough for writers who write It would also be useful for some other purposes. Finally, if your text is mixed, with both Chinese and English, say, then Thoughts? |
The reason to support hand-wrapped text paragraphs is because it looks better. That's actually one of the principles of Markdown. What I am doing here is to extend this principle to other writing systems, as the default Markdown is biased toward Western writing systems. I think your proposal would make sense if Markdown didn't support hand-wrapped text paragraphs from the very beginning, where all line breaks were kept. BTW I think the stack-like options is not difficult to implement; this line
can be changed to
and we only look at the head. I guess this won't hurt performance too much (compared to my current hack). The up side of this approach is that we can change the line break setting on the fly and paragraphs remain composable. The down side is that we need to extend the syntax carefully, because an explicit option might break another principle of Markdown (being publishable as it is). For example the following option works but clearly indicates that the document is mark upped. %% linebreak push raw
Words
with
newlines
%% linebreak pop An simpler alternative to this is to give up composability and have %% linebreak raw
Words
with
newlines
%% linebreak space instead. In this case For the Unicode algorithm I guess only the document is overly complicated. Also I think an approximation is good enough. I like your idea that ignores the space automatically, but I might want something smarter. In sum, I think these are good approaches:
Personally I would like to test the smart converter. I understand your concern about speed but I would like to try it out. How should I test the parsing speed? What's the standard benchmark that I can measure the impact on speed? |
+++ favonia [Jul 23 12 06:41 ]:
I don't like the ideas that involve non-textual markup codes. The whole But 3 may be worth exploring further.
There's a Benchmark.hs in the main directory that you can use. |
Cool. I think we at least agree that an option to support hard breaks will be useful. As for the interface, how about changing the data LineBreakConv
= LineBreakSpace -- ^ Convert newline characters to spaces
| LineBreakSmart -- ^ Use some heuristics to handle newline characters in multilingual text
| LineBreakPreserve -- ^ Preserve all newline characters where BTW, I'm aware that my English is terrible so I'm all ears on better descriptions and better names for these options and data types. |
That sounds good, at least for now. We can always change the data type +++ favonia [Jul 23 12 10:12 ]:
|
or avoid using soft line breaks in Chinese text. This is the solution I'm going with atm. |
I believe this can be closed, now that we have the |
I have markdown files that actually use spaces to separate groups of Chinese words. Presumably, the spaces were used to make the content easier to understand. Now I would like to convert the markdown files to Word, ignore spaces between Chinese characters before converting to Word. Could you help to comment whether pandoc can handle this case via something such as east_asian_word_breaks ? |
There are languages (Chinese, Japanese, etc) that do not use spaces to separate words. This is my hack to add a new option
--linebreak
to change the default newline conversion formarkdown
input. It can be eitherspace
(the default, original behavior),skip
, orpreserve
now. I'd like to hear your opinions on this and work on this patch more. For example make this option work for other kinds of input.Ideally we should be able to mark a particular region with a particular language, and the conversion should be context-aware. I only implemented the basic control here, and it's valid only for
markdown
input. For related information one can seetext-space-collapse
in current CSS3 draft and some question on stackoverflow.