Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tidy indent+wrap breaks <pre> formatting #697

Closed
geoffmcl opened this issue Mar 17, 2018 · 6 comments
Closed

tidy indent+wrap breaks <pre> formatting #697

geoffmcl opened this issue Mar 17, 2018 · 6 comments

Comments

@geoffmcl
Copy link
Contributor

Issue from https://lists.w3.org/Archives/Public/html-tidy/2018JanMar/0016.html

tidy -indent -wrap breaks <pre> formatting if the line is wrapped

<pre>a long line
</pre>

when indented and wrapped ends up like this -

      <pre>
      a long line
</pre>

The pre-formatted text shouldn't have leading spaces added. It should look like

      <pre>
a long line
</pre>

example: using tidy --version HTML Tidy for Cygwin version 5.6.0

run $ tidy -indent -wrap 78 --wrap-attributes yes --tidy-mark no

Input:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
          "http://www.w3.org/TR/html4/loose.dtd">
<html><head><title>simple test case</title></head><body>
<div><div><div>
<PRE class="SCREEN">x1234567890123456789012345678901234567890123456789
</PRE >
</div></div></div>
</body></html>

Output:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html>
<head>
  <title>simple test case</title>
</head>
<body>
  <div>
    <div>
      <div>
        <pre class="SCREEN">
        x1234567890123456789012345678901234567890123456789
</pre>
      </div>
    </div>
  </div>
</body>
</html>

Regards,
Lee

@geoffmcl geoffmcl added the Bug label Mar 17, 2018
@geoffmcl geoffmcl added this to the 5.7 milestone Mar 17, 2018
@geoffmcl
Copy link
Contributor Author

Strange, I though this would be easy... just prevent the indent spaces to be ouput before the x...

But it turns out quite involved... You can have a really long single line of text in a <pre> and tidy knows to not allow it to be wrapped, but when it actually comes to output the line, it will add the indent spaces before...

The problem happens in WrapLine( doc ), which calls WantIndent(doc), and if that returns yes then tidy will ouput the indent spaces before the x... which in this <pre> case, we do not want...

But to back up a little. We are in PPrintTree and we eventually reach the <pre> node, which is handled in the else /* some kind of container element */ part of the code... and enters the following code block (some newlines, spaces deleted for brevity, and some extra comment added) -

    if ( node->tag && (node->tag->parser == TY_(ParsePre) || nodeIsTEXTAREA(node)) ) {
        Bool classic  = TidyClassicVS; /* #228 - cfgBool( doc, TidyVertSpace ); */
        uint indprev = indent;
        PCondFlushLineSmart( doc, indent ); /* about to add <pre> tag - clear any previous */
        /* insert extra newline for classic formatting */
        if (classic && node->parent && node->parent->content != node)
            TY_(PFlushLineSmart)( doc, indent );
        PPrintTag( doc, mode, indent, node );   /* add <pre> or <textarea> tag */
        indent = 0; // NOTE: indent is zeroed here
        /* @camoy Fix #158 - remove inserted newlines in pre - TY_(PFlushLineSmart)( doc, indent ); */
        for ( content = node->content; content; content = content->next )
            TY_(PPrintTree)( doc, (mode | PREFORMATTED | NOWRAP), indent, content ); // NOTE: PREFORMATTED and NOWRAP bits added
        /* @camoy Fix #158 - remove inserted newlines in pre - PCondFlushLineSmart( doc, indent ); */
        indent = indprev;
        PPrintEndTag( doc, mode, indent, node );
        if ( cfgAutoBool(doc, TidyIndentContent) == TidyNoState && node->next != NULL )
            TY_(PFlushLineSmart)( doc, indent );
    }

Now note a few important things. After the <pre...attr> is added to the print buffer, any indent is zeroed, and when printing the contents PREFORMATTED and NOWRAP bits are added. So we should know this throughout the printing of a <pre> and <textarea> elements, and their text children.

You would think the PREFORMATTED bit would also indicate that tidy should not add its own indent, which was also deliberately set zero before start this output... sort of two reason why to not add an indent... but somehow this is forgotten...

For sure, no matter how long the <pre> text is, here I am talking about many words with spaces, tidy does remember NOWRAP...

So as indicated this starts with WantIndent( doc ) return yes, and its code is -

    TidyPrintImpl* pprint = &doc->pprint;
    Bool wantIt = GetSpaces(pprint) > 0;
    if ( wantIt ) {
        Bool indentAttrs = cfgBool( doc, TidyIndentAttributes );
        wantIt = ( ( !IsWrapInAttrVal(pprint) || indentAttrs ) && !IsWrapInString(pprint) );
    }
    return wantIt;

Now GetSpaces(pprint) will return pprint->indent[ 0 ].spaces, and in this case will be 8. This can not be zeroed, since we do need to remember where we are after the <pre>...</pre> to continue onto the next element... and likewise we will not be IsWrapInAttrVal and not IsWrapInString so will return yes and the indent will be added and the damage done...

Now after this, we are in PFlushLine, the pprint->indent[ 0 ].spaces = indent; which is zero at the time... So does this bad indent only happen on the first line of a <pre>? YES! Subsequent lines are not indented!!!

Yowee, is this due to not setting the indent zero early enough? See issue #158 above. There used to be a call to PFlushLineSmart after setting the indent to zero, which got removed, but one of the effect of such a flush is to set pprint->indent[ 0 ].spaces to the indent, now zero... Try a small fix -

diff --git a/src/pprint.c b/src/pprint.c
index 57cd7b1..61845ee 100644
--- a/src/pprint.c
+++ b/src/pprint.c
@@ -2135,6 +2135,8 @@ void TY_(PPrintTree)( TidyDocImpl* doc, uint mode, uint indent, Node *node )
             Bool classic  = TidyClassicVS; /* #228 - cfgBool( doc, TidyVertSpace ); */
             uint indprev = indent;
 
+            indent = 0; /* Issue #697 - set this before the flush */
+
             PCondFlushLineSmart( doc, indent ); /* about to add <pre> tag - clear any previous */
 
             /* insert extra newline for classic formatting */
@@ -2145,7 +2147,6 @@ void TY_(PPrintTree)( TidyDocImpl* doc, uint mode, uint indent, Node *node )
 
             PPrintTag( doc, mode, indent, node );   /* add <pre> or <textarea> tag */
 
-            indent = 0;
             /* @camoy Fix #158 - remove inserted newlines in pre - TY_(PFlushLineSmart)( doc, indent ); */
 
             for ( content = node->content; content; content = content->next )

Still testing this seemingly small fix, but as we have learned over the years, sometimes very small changes have profound consequences, especially in tidy... but so far this seems to work...

Still not sure this first x line should be wrapped to a newline, but browsers seem to ignore this leading newline, and at least it is now flushed left...

Now this newline is introduced due to the wrap condition, but as #158 discusses maybe the should be no wrap applied... do not mess with <pre> formatting... and this is reinforced by the fact that the NOWRAP bit is set...

So more thought needed here. Meantime hope someone gets a chance to test this patch, and advise...

And if anyone has some more ideas on this would love to hear from you... thanks...

@geoffmcl
Copy link
Contributor Author

Have pushed this small patch to an issue-697 branch for consideration and discussion...

This patch has several other consequences!

Yes, it does avoid adding any indent to the first text line of the <pre>, which is good...

But it also removes the indent from the previously indented <pre> tag, and <textarea> tags. Now this may be acceptable, but it does mean adjusting about 10 of our regression tests expects output. Again this may be acceptable...

And it does not remove a newline added after the <pre> if the line length is greater than the current wrap value, as discussed in #158...

So at best this is a partial fix, with other consequences... so as mentioned maybe there is a better way to address this, and its consequences...

Look forward to further feedback... thanks...

geoffmcl added a commit that referenced this issue Mar 20, 2018
@geoffmcl
Copy link
Contributor Author

As stated was not happy with the fix in the issue-697 branch... it had too many other consequences...

So went back to the drawing board! Started with next again...

Have have now created an issue-697-1 branch, and this is looking good! See commit e36df3b for the small diff...

Added 4 test files to my test repo... each should be run with the -i indent option. The lines are already greater than the default wrap width of 68...

  1. The original test in_697.html now keeps the long line on the same line as the indented <pre ...> tag. The class attribute is wrapped, but this is not a problem.

  2. Test in_697-1.html has the closing </pre> on the same line, and this is kept in the output. And it has a really really long line of words in a <pre>. . .</pre> and this is kept in the output.

  3. Test in_697-2.html has 3 long lines, and no indent is added to any of them.

  4. Test in_697-3.html has a paragraph <p> before and after the inline <pre> and they are all indented correctly.

And the best news of all, this 697-1 version of tidy breaks nothing in the regression tests... passes 100%...

Will continue testing, and will get around to deleting the first issue-697 branch, and creating a PR for this new issue-697-1 branch...

Would ask others to checkout and test this issue-697-1 branch and report... thanks...

But presently think, hope, this pretty print bug is solved...

@ler762
Copy link
Contributor

ler762 commented Mar 21, 2018

issue-697-1 branch looks good

I did a quick check & haven't seen any cases where it breaks pre formatting.

The bug seems to be solved, so I guess this is a feature request
when indenting, if the <pre> block of text is going to take more than one line, have the tag on one line and the text starting on the next line.

https://www.w3.org/TR/html/grouping-content.html#the-pre-element
NOTE:
In the HTML syntax, a leading newline character immediately following the pre element start tag is stripped.

so

  <pre>line 1
line 2</pre>

is equivalent to

    <pre>
line 1
line 2</pre>

but much more readable (which is why I'm using indent + wrap)

In any case - thanks for fixing this!

@geoffmcl
Copy link
Contributor Author

geoffmcl commented Apr 3, 2018

@ler762 thanks for testing this in the issue-697-1 branch, and reporting...

Have now added PR #708 to merge and hopefully close this sometime soon...

As to your preference to add a newline after the indented <pre>, please note, previously, back in tidy circa 2000, when this was being done, it also removed the indent from the <pre>, as in -

  <body>
    <div>
      <div>
        <div>
<pre class="SCREEN">
x1234567890123456789012345678901234567890123456789
</pre>
        </div>
      </div>
    </div>
  </body>

Just saying that it may not be possible to get both... but maybe someone can find a way to have both ;=))

Anyway if you feel strongly about this then please consider opening a new separate Feature Request for this... thanks...

geoffmcl added a commit that referenced this issue Apr 22, 2018
Is #697 - Add NOWRAP to print of pre tag - PR #708
@geoffmcl
Copy link
Contributor Author

@ler762 have now merged #708, so closing this... thanks...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants