Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

What is the true purpose and use case of the --bare option? #896

Closed
JustAGuyCoding opened this issue Sep 27, 2020 · 5 comments
Closed

What is the true purpose and use case of the --bare option? #896

JustAGuyCoding opened this issue Sep 27, 2020 · 5 comments

Comments

@JustAGuyCoding
Copy link

What is the true purpose of the --bare option? Tidy help says it's to strip out smart quotes and other em dashes. More information under help-option indicates it's for cleaning up MSWord documents.

These two descriptions of the option don't seem the same:

tidy --help
-bare, -b strip out smart quotes and em dashes, etc.

tidy -help-option bare
This option specifies if Tidy should strip Microsoft specific HTML from Word
2000 documents, and output spaces rather than non-breaking spaces where they
exist in the input.

After briefly reading the documentation at https://api.html-tidy.org/tidy/quickref_5.6.0.html I thought bare could be used to clean up MSWord documents but was surprised when it substituted hyphens for em-dashes.

Side note: There is also a word-2000 option which is for cleaning MSWord documents. This seems tailored for MSWord WebPage exports, opposed to MSWord WebPage filtered exports.

Related issues:
#885

@JustAGuyCoding
Copy link
Author

Further reading:

@geoffmcl
Copy link
Contributor

@JustAGuyCoding wow, thanks for the Futher reading: link... that seems to confirm my interpretation, from the code...

Take the From: Jelks Cabaniss question, and reply From: Lee Passey, on 2002-02-21 -

I'm wondering if overloading --bare to strip redundant spaces and
"convert down" other characters is a good idea. Wouldn't it
be better to separate that functionality into two different
options?

The downgrading of characters is not overloading the --bare option;
that is what it does now, and it is the only thing it does. In other
words, we already have two different options: --word-2000 to
attempt to clean up M$ mess, and --bare to convert some recent named
entities which are not yet widely supported.

So, as suggested, in #885, I think the following patches need to be implemented, to completely separate the functionality of bare, from word-2000... as the code implies...

diff --git a/include/tidyenum.h b/include/tidyenum.h
index 3daee5b..e3fa793 100644
--- a/include/tidyenum.h
+++ b/include/tidyenum.h
@@ -610,7 +610,7 @@ typedef enum
     TidyLiteralAttribs,          /**< If true attributes may use newlines */
     TidyLogicalEmphasis,         /**< Replace i by em and b by strong */
     TidyLowerLiterals,           /**< Folds known attribute values to lower case */
-    TidyMakeBare,                /**< Make bare HTML: remove Microsoft cruft */
+    TidyMakeBare,                /**< Replace smart quotes, em dashes, etc with ASCII. */
     TidyMakeClean,               /**< Replace presentational clutter by style rules */
     TidyMark,                    /**< Add meta element indicating tidied doc */
     TidyMergeDivs,               /**< Merge multiple DIVs */
diff --git a/src/clean.c b/src/clean.c
index e96dd3f..059e9da 100644
--- a/src/clean.c
+++ b/src/clean.c
@@ -1890,8 +1890,7 @@ void TY_(CleanWord2000)( TidyDocImpl* doc, Node *node)
         if ( nodeIsHTML(node) )
         {
             /* check that it's a Word 2000 document */
-            if ( !TY_(GetAttrByName)(node, "xmlns:o") &&
-                 !cfgBool(doc, TidyMakeBare) )
+            if ( !TY_(IsWord2000) (doc) )
                 return;
 
             /* Output proprietary attributes to maintain errout compatability
diff --git a/src/language_en.h b/src/language_en.h
index 60bde02..eab5567 100644
--- a/src/language_en.h
+++ b/src/language_en.h
@@ -786,9 +786,9 @@ static languageDefinition language_en = { whichPluralForm_en, {
       - The strings "Tidy" and "HTML Tidy" are the program name and must not
       be translated. */
         TidyMakeBare,                 0,
-        "This option specifies if Tidy should strip Microsoft specific HTML "
-        "from Word 2000 documents, and output spaces rather than non-breaking "
-        "spaces where they exist in the input. "
+        "This option specifies if Tidy should replace smart quotes and em dashes with "
+        "ASCII, and output spaces rather than non-breaking "
+        "spaces, where they exist in the input. "
     },
     {/* Important notes for translators:
       - Use only <code></code>, <var></var>, <em></em>, <strong></strong>, and

This seems to clear up the issue of when bare should be used, its meaning and purpose... and makes the docs match the code...

Simply, bare can be applied to any html, that includes recent named entities which are not yet widely supported., although I would question that quote from 2002 still applies in 2020... I, being an oldy, do remember editors which would choke on such characters, and would be thankful for their conversion to pure ASCII... but nothing like that in my Windows 10, or linux machine these days...

I am perfectly happy to leave them as utf8, everytime, everywhere... smart quotes being E2 80 9C, E2 80 9D, and emdash E2 80 94

And these patches allows the word-2000 option to do its best clean up, of both unfiltered, and filtered MS Word exports... maybe there is a case for even more filtering, but that should be a new issue...

I still waiver whether there should even be the 'IsWord2000' filter at all left there... but can leave that for another time...

More information under help-option indicates it's for cleaning up MSWord documents.

That is corrected, in one of the patches, but there could be alternate wording, even in other places not addressed...

I will try to find the time to add a PR for this issue... unless someone beat me to it...

Look forward to further feedback, comments, even alternate patches, other code, etc, to help in finalising this into a PR... thanks...

@geoffmcl
Copy link
Contributor

geoffmcl commented Oct 4, 2020

@JustAGuyCoding have now created the issue-896 branch, so you can now do -

$ cd tidy-html5
$ git pull
$ git checkout issue-896
$ # build tidy as you normally would...

To the cmake config step I usually add something like -DTIDY_RC_NUMBER=I896, to help identify the EXE...

Then you can view, review, test, etc, the changes...

Have I missed any references in the docs, or in the code, anywhere... that perpetuates any direct tie between bare and word-2000 options? Advise... thanks...

Will shortly get around to setting up the Pull Request...

@geoffmcl
Copy link
Contributor

geoffmcl commented Oct 5, 2020

@JustAGuyCoding, have now created the PR #898 ...

Look forward to further feedback, comments, even alternate patches, other code, etc, to help complete the PR... thanks...

geoffmcl added a commit that referenced this issue Nov 21, 2020
* Is. #896 - make 'bear' docs match code

* Is. #487 #462 add warn msg and do not get stuck until eof

The warning message could perhaps be better worded, and maybe there
should be another msg when a '>' is encountered while looking for a ']'
in a MS Word section, and perhaps the section should be discarded...

And perhaps it should be an error, to force the user to fix...

But the fix is good as it is, and these issues can be dealt with
later...

And this fix is piggy backed on this PR, but it is likewise related to
'word-2000' option...
@geoffmcl
Copy link
Contributor

@JustAGuyCoding have now merged #898, fixing the docs to conform to the actual code, so closing this... thanks...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants