Do you have crappy HTML? I do!
<table width="100%" border="0" cellspacing="0" cellpadding="0">
<tr>
<td height="31"><b>Currently we have these articles available:</b>
<blockquote>
<p><a href="foo.html">The History of Foo</a><br />
An <span color="red">informative</span> piece of <font face="arial">information</font>.</p>
<p><A HREF="bar.html">A Horse Walked Into a Bar</A><br/> The bartender said
"Why the long face?"</p>
</blockquote>
</td>
</tr>
</table>
Just look at those blank lines and random line breaks, trailing spaces, mixed tabs, deprecated tags - it's outrageous!
Let's clean it up:
var cleaner = require('clean-html'),
fs = require('fs'),
filename = process.argv[2];
fs.readFile(filename, function (err, data) {
cleaner.clean(data, function (html) {
console.log(html);
});
});
Running this script on the file above produces the following output:
<table>
<tr>
<td>
<b>Currently we have these articles available:</b>
<blockquote>
<p>
<a href="foo.html">The History of Foo</a>
<br>
An <span>informative</span> piece of information.
</p>
<p>
<a href="bar.html">A Horse Walked Into a Bar</a>
<br>
The bartender said "Why the long face?"
</p>
</blockquote>
</td>
</tr>
</table>
You can pass additional options to the clean
function like this:
var options = {
'add-remove-tags': ['table', 'tr', 'td', 'blockquote']
};
cleaner.clean(data, options, function (html) {
console.log(html);
});
In this case, it produces:
<b>Currently we have these articles available:</b>
<p>
<a href="foo.html">The History of Foo</a>
<br>
An <span>informative</span> piece of information.
</p>
<p>
<a href="bar.html">A Horse Walked Into a Bar</a>
<br>
The bartender said "Why the long face?"
</p>
Sanity restored!
Adds line breaks before and after comments.
Type: Boolean
Default: true
Tags that should have line breaks added before and after.
Type: Array
Default: ['body', 'blockquote', 'br', 'div', 'h1', 'h2', 'h3', 'h4', 'h5', 'h6', 'head', 'hr', 'link', 'meta', 'p', 'table', 'title', 'td', 'tr']
The string to use for indentation. e.g., a tab character or one or more spaces.
Type: String
Default: ' '
(two spaces)
Attributes to remove from markup.
Type: Array
Default: ['align', 'bgcolor', 'border', 'cellpadding', 'cellspacing', 'color', 'height', 'target', 'valign', 'width']
Removes comments.
Type: Boolean
Default: false
Tags to remove from markup if empty.
Type: Array
Default: []
Tags to always remove from markup. Nested content is preserved.
Type: Array
Default: ['center', 'font']
Replaces non-breaking white space entities (
) with regular spaces.
Type: Boolean
Default: false
The column number where lines should wrap. Set to 0 to disable line wrapping.
Type: Integer
Default: 120
These options exist for your convenience.
Additional tags to include in break-around-tags
.
Type: Array
Default: null
Additional attributes to include in remove-attributes
.
Type: Array
Default: null
Additional tags to include in remove-tags
.
Type: Array
Default: null
If this package is installed globally, it can be used from the command line:
$ cat crappy.html | clean-html
Instead of piping the input from another program, you can supply a filename as the first argument:
$ clean-html crappy.html
You can redirect the output to another file:
$ clean-html crappy.html > clean.html
Or you can edit the file in place:
$ clean-html crappy.html --in-place
All of the options above can be used from the command line. Array option values should be separated by commas:
$ clean-html crappy.html --add-remove-tags b,i,u
Boolean options can be set to true like this:
$ clean-html crappy.html --remove-comments
Or like this
$ clean-html crappy.html --remove-comments true
They can be set to false like this:
$ clean-html crappy.html --remove-comments false