Skip to content

Handling of newlines when table elements contain elements like: p and div #63

@janis-veinbergs

Description

@janis-veinbergs

My goal is to have more human readable tables. Is this actually the goal of reversemarkdown library? Am I using it wrong?

Consider two input HTML that produce different outputs, but should produce identical:

<html><body><table><tbody><tr><td><p>col1</p></td><td><p>col2</p></td></tr><tr><td><p>data1</p></td><td><p>data2</p></td></tr></tbody></table></body></html>

Out:

| col1<br> | col2<br> |
| --- | --- |
| data1<br> | data2<br> |
<html><body><table><tbody><tr><td><p>
col1</p></td><td><p>col2</p></td></tr><tr><td>
<p>
data1
</p>
</td>
<td><p>data2</p></td></tr></tbody></table></body></html>

Out:

| col1<br> | col2<br> |
| --- | --- |
| <br>data1<br> | data2<br> |

I would expect output in both cases to be this:

| col1 | col2 |
| --- | --- |
| data1 | data2 |

There are 2 issues

p converter no matter what appends newline to end:

return $"{indentation}{TreatChildren(node).Trim()}{Environment.NewLine}";

Browser doesn't. And Outlook generates tables just like that - td > p > #text

The incomplete fix would be checking if p (perhaps any flow content) is last element within td and not add trailing newline.

So considering some scenarios:

  1. <td><p>data1</p></td>: browser renders no newlines. Reversemarkdown excess ending br: | data1<br> |
  2. <td>data1<p>p</p></td>. Browser renders newline before p. Reversemarkdown excess ending
    : | data1<br>p<br> |
  3. <td><div><p>data1</p></div></td>, browser renders no newlines. Reversemarkdown excess starting and ending br: | <br>data1<br> |

I don't know what would be the best way to handle these newlines. Because when I convert real-life html that comes from outlook, it is just overwhelmed with newlines.

Cases are many. Fixing those I brought up is possible. I should probably do it?

line break after starting tag and before ending tag should be ignored

This is some non-standard document, but currently browser behaves like that:

A line break occurring immediately following a start tag must be ignored, as must a line break occurring immediately before an end tag. This applies to all HTML elements without exceptions. In addition, for all elements except PRE, leading white space characters, such as spaces, horizontal tabs, form feeds and line breaks, following the start tag must be ignored, and any subsequent sequence of contiguous white space characters must be replaced by a single word space.
The following three examples must be rendered identically:

<P>Thomas is watching TV.</P>
<P>
Thomas is watching TV.
</P>
<P>
   Thomas is watching TV.
</P>

w3.org HTML Text element

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions