Handle newlines inside strong/emphasis tags #123

a-warner · 2015-11-10T16:03:23Z

Previously the following markup:

<em>Hello <br><br> world</em>

would get converted into the markdown:

_hello

world_

which is incorrect. Both lines should be surrounded with underscores. This pull request splits em, i,strong, andb` tags into lines and surrounds each line with the appropriate markdown

domchristie · 2015-11-10T22:47:36Z

Thanks for this!

It’s an interesting case because I don’t think it’s possible to produce Hello world from Markdown, so it raises questions as to what the correct outcome is?! :/

Your proposed outcome is:

_Hello_
_world_

which converts back to HTML as:

<p><em>Hello</em>
<em>world</em></p>

which renders as:

Hello world

To make this a bit more robust, perhaps toMarkdown should place any number of newlines with a space in any inline element before converting it?

a-warner · 2015-11-11T01:42:10Z

yea actually my proposed outcome for Hello world converted to markdown is actually:

_Hello_

_world_

which, translated back to html, is:

HelloWorld

which I think is right, but that's not what the old version of my code actually did. The new version I think handles this case properly.

Thoughts?

domchristie · 2015-11-11T20:55:07Z

Thanks! I think the result of your most recent version is more inline with the outcome of the rendered HTML, however I am not entirely convinced that a converter is the correct place to handle this case.

Converters should be as simple as possible, and should contain as little logic as possible. This is unavoidable in some cases (e.g. ul/ol), but it feels like the em/strong converter is doing too much.

In this case I’m favouring stripping multiple newlines from inline elements before they are converted (i.e. somewhere in the process function). It’d mean that some accuracy may be lost in this case, but it keeps the complexity down to a minimum.

Hope that makes sense?!

a-warner · 2015-11-11T22:53:35Z

In this case I’m favouring stripping multiple newlines from inline elements before they are converted (i.e. somewhere in the process function). It’d mean that some accuracy may be lost in this case, but it keeps the complexity down to a minimum.

So you're saying that:

foo bar should become _foo bar_? (i.e. with no newlines in there at all?)

domchristie · 2015-11-12T09:29:14Z

Perhaps

<em>foo<br><br>bar</em>

should convert to

_foo  
bar_

which would then result in the following when converted back to HTML:

<p><em>foo <br />
bar</em></p>

This could be achieved by doing something like the following in the process function:

// Remove line-breaks caused by multiple <br>
if(isInline(node)) content = content.replace(/(  \n){2,}/g, '  \n')

Obviously it’s not perfect, but considering that it’s not possible to create the original HTML from markdown, and that it fixes the current issue (of the content not being wrapped correctly), as well as maintaining the simplicity of the converters, I feel it’s a reasonably good solution.

Has this HTML been generated by contentEdtitable by any chance? If so, you may want to check out https://github.com/guardian/scribe which normalises some of the contentEditable weirdness.

a-warner · 2015-11-12T16:41:05Z

@domchristie as backstory, I came across this issue while trying to convert HTML generated by trix editor to markdown. In trix, if you turn on italics and hit return twice, it'll generate this markup. It's perfectly valid html, which I agree cannot reasonably be generated by markdown.

Is the goal of this library to only convert HTML that could be generated by markdown into markdown? If so then I think I agree your solution is the best. If the goal is to do a best-effort two-way conversion between the two then I think collapsing multiple newlines into one newline isn't great.

Either way I agree that the processor should probably handle this instead of the converter, good call! What if I change the processor to call inline converters once per line? Another option would be to convert   to a paragraph break, though that might be more difficult.

domchristie · 2015-11-12T18:46:18Z

Is the goal of this library to only convert HTML that could be generated by markdown into markdown? If so then I think I agree your solution is the best. If the goal is to do a best-effort two-way conversion between the two then I think collapsing multiple newlines into one newline isn't great.

Yeh, it’s an interesting point. The goal of this library isn’t to only convert HTML that has been generated from Markdown, but to take a best-effort approach. However that’s not to say that it’s responsibility is to handle all cases including invalid or poorly authored HTML.

About , the spec says:

br elements must be used only for line breaks that are actually part of the content, as in poems or addresses.

So to have two successive  s seems rather odd, even though it is perfectly valid.

In this case of Hello world, knowing it comes from a WYSIWYG, the intention is probably to have two paragraphs: one containing Hello and the other with world, but this may not be true in all cases. If it is the case then it should probably be fixed before passing it in to to-markdown, since it’d be impossible for to-markdown to determine the intent of every sequence of markup of this kind.

The options:

Original

<p>
  <em>Hello <br><br> world</em>
</p>

Option 1.

More than one line-break is squashed to just one. Results in:

_Hello  
world_

Converted back to HTML:

<p>
  <em>Hello <br>world</em>
</p>

Option 2.

Two paragraphs are created:

_Hello_

_world_

<p>
  <em>Hello</em>
</p>

<p>
  <em>world</em>
</p>

To me they’re both incorrect! Option 1 maintains more of the original markup (since the  s may be deliberate) but is still lossy, whereas Option 2 probably spaces the text appropriately but loses the  s and adds another paragraph which alters the meaning.

So in cases like this, I’d opt for the simplest, most maintainable solution.

As I mentioned above, I don’t think it’s the responsibility of the library to fix/alter the HTML structure, but the idea of converting line-by-line sounds interesting!

Aside: Interestingly The Guardian’s Scribe creates a new paragraph on return. It’s possible to create line-breaks but harder to do so (shift+return) http://guardian.github.io/scribe/

domchristie · 2015-11-13T10:17:31Z

I have contributed some thoughts to the issue on trix here: basecamp/trix#75 (comment)

dylanjmcdonald · 2020-02-25T22:49:03Z

Any news on if this will be implemented?

I have a similar issue where I'm converting html from emails to markdown. I don't have control over the markup of the emails coming through.

This markup

<div dir="ltr"><b>Here&#39;s a link:<br><br></b><a href="https://www.google.com/">https://www.google.com/</a><div><br></div><div><br></div></div>

gets converted to

**Here's a link:    **[https://www.google.com/](https://www.google.com/)

Handle newlines inside strong/emphasis tags

94e4eb7

a-warner force-pushed the master branch from 77da00f to 94e4eb7 Compare November 11, 2015 01:32

domchristie mentioned this pull request Aug 26, 2018

Line breaks inside format tags #253

Open

domchristie mentioned this pull request Jan 21, 2019

Merging repeated elements without any non-whitespace content between them #270

Open

MattiSG mentioned this pull request May 3, 2023

Markdown bold with newline does not display correctly OpenTermsArchive/contribution-tool#144

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handle newlines inside strong/emphasis tags #123

Handle newlines inside strong/emphasis tags #123

a-warner commented Nov 10, 2015

domchristie commented Nov 10, 2015

a-warner commented Nov 11, 2015

domchristie commented Nov 11, 2015

a-warner commented Nov 11, 2015

domchristie commented Nov 12, 2015

a-warner commented Nov 12, 2015

domchristie commented Nov 12, 2015

domchristie commented Nov 13, 2015

dylanjmcdonald commented Feb 25, 2020

Handle newlines inside strong/emphasis tags #123

Are you sure you want to change the base?

Handle newlines inside strong/emphasis tags #123

Conversation

a-warner commented Nov 10, 2015

domchristie commented Nov 10, 2015

a-warner commented Nov 11, 2015

domchristie commented Nov 11, 2015

a-warner commented Nov 11, 2015

domchristie commented Nov 12, 2015

a-warner commented Nov 12, 2015

domchristie commented Nov 12, 2015

Original

Option 1.

Option 2.

domchristie commented Nov 13, 2015

dylanjmcdonald commented Feb 25, 2020