Skip to content

Commit

Permalink
selectors, baseElements, compile/convert
Browse files Browse the repository at this point in the history
  • Loading branch information
KillyMXI committed Jun 2, 2021
1 parent 411dc3e commit 827a3f2
Show file tree
Hide file tree
Showing 17 changed files with 856 additions and 377 deletions.
4 changes: 2 additions & 2 deletions .eslintrc.js
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ module.exports = {
'plugin:jsdoc/recommended',
'plugin:mocha/recommended'
],
parserOptions: {},
parserOptions: { ecmaVersion: 2018 },
env: {
es6: true,
node: true,
Expand Down Expand Up @@ -100,7 +100,7 @@ module.exports = {
'semi': 'error',
'semi-spacing': 'error',
'semi-style': 'error',
'sort-keys': ['error', 'asc', { minKeys: 3 }],
'sort-keys': ['error', 'asc', { minKeys: 4 }],
'space-before-blocks': 'error',
'space-before-function-paren': ['error'],
'space-in-parens': 'error',
Expand Down
48 changes: 48 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,53 @@
# Changelog

## Version 8.0.0 (WIP)

All commits: [7.1.1...8.0.0](https://github.com/html-to-text/node-html-to-text/compare/7.1.1...8.0.0)

Version 8 roadmap issue: [#228](https://github.com/html-to-text/node-html-to-text/issues/228)

### Selectors

The main focus of this version. Addresses the most demanded user requests ([#159](https://github.com/html-to-text/node-html-to-text/issues/159), [#179](https://github.com/html-to-text/node-html-to-text/issues/179), partially [#143](https://github.com/html-to-text/node-html-to-text/issues/143)).

It is now possible to specify formatting options or assign custom formatters not only by tag names but by almost any selectors.

See the README [Selectors](https://github.com/html-to-text/node-html-to-text#selectors) section for details.

Note: The new `selectors` option is an array, in contrast to the `tags` option introduced in version 6 (and now deprecated). Selectors have to have a well defined order and object properties is not a right tool for that.

Two new packages were created to enable this feature - [parseley](https://github.com/mxxii/parseley) and [selderee](https://github.com/mxxii/selderee).

### Base elements

The same selectors implementation is used now to narrow down the conversion to specific HTML DOM fragments. Addresses [#96](https://github.com/html-to-text/node-html-to-text/issues/96). (Previous implementation had more limited selectors format.)

BREAKING CHANGE: All outermost elements matching provided selectors will be present in the output (previously it was only the first match for each selector). Addresses [#215](https://github.com/html-to-text/node-html-to-text/issues/215).

`limits.maxBaseElements` can be used when you only need a fixed number of base elements and would like to avoid checking the rest of the source HTML document.

Base elements can be arranged in output text in the order of matched selectors (default, to keep it closer to the old implementation) or in the order of appearance in sourse HTML document.

BREAKING CHANGE: previous implementation was treating id selectors in the same way as class selectors (could match `<foo id="a b">` with `foo#a` selector). New implementation is closer to the spec and doesn't expect multiple ids on an element. You can achieve the old behavior with `foo[id~=a]` selector in case you rely on it for some poorly formatted documents (note that it has different specificity though).

### Batch processing

Since options preprocessing is getting more involved with selectors compilation, it seemed reasonable to break the single `htmlToText()` function into compilation and convertation steps. It might provide some performance benefits in client code.

* new function `compile(options)` returns a function of a single argument (html string);
* `htmlToText(html, options)` is now an alias to `convert(html, options)` function and works as before.

### Deprecated options

* `baseElement`;
* `returnDomByDefault`;
* `tables`;
* `tags`.

Refer to README for [migration instructions](https://github.com/html-to-text/node-html-to-text#deprecated-or-removed-options).

No previously deprecated stuff is removed in this version. Significant cleanup is planned for version 9 instead.

## Version 7.1.1

Regenerate `package-lock.json`.
Expand Down
171 changes: 113 additions & 58 deletions README.md

Large diffs are not rendered by default.

7 changes: 6 additions & 1 deletion lib/block-text-builder.js
Original file line number Diff line number Diff line change
@@ -1,4 +1,7 @@

// eslint-disable-next-line no-unused-vars
const { Picker } = require('selderee');

const { trimCharacter } = require('./helper');
// eslint-disable-next-line no-unused-vars
const { StackItem, BlockStackItem, TableCellStackItem, TableRowStackItem, TableStackItem, TransformerStackItem }
Expand All @@ -21,9 +24,11 @@ class BlockTextBuilder {
* Creates an instance of BlockTextBuilder.
*
* @param { Options } options HtmlToText options.
* @param { Picker<DomNode, TagDefinition> } picker Selectors decision tree picker.
*/
constructor (options) {
constructor (options, picker) {
this.options = options;
this.picker = picker;
this.whitepaceProcessor = new WhitespaceProcessor(options);
/** @type { StackItem } */
this._stackItem = new BlockStackItem(options);
Expand Down
2 changes: 1 addition & 1 deletion lib/formatter.js
Original file line number Diff line number Diff line change
Expand Up @@ -326,7 +326,7 @@ function formatDataTable (elem, walk, builder, formatOptions) {
function walkTable (elem) {
if (elem.type !== 'tag') { return; }

const formatHeaderCell = (formatOptions.uppercaseHeaderCells)
const formatHeaderCell = (formatOptions.uppercaseHeaderCells !== false)
? (cellNode) => {
builder.pushWordTransform(str => str.toUpperCase());
formatCell(cellNode);
Expand Down
53 changes: 29 additions & 24 deletions lib/helper.js
Original file line number Diff line number Diff line change
@@ -1,27 +1,5 @@

/**
* Split given tag selector into it's components.
* Only element name, class names and ID names are supported.
*
* @param { string } selector Tag selector ("tag.class#id" etc).
* @returns { { classes: string[], element: string, ids: string[] } }
*/
function splitSelector (selector) {
function getParams (re, string) {
const captures = [];
let found;
while ((found = re.exec(string)) !== null) {
captures.push(found[1]);
}
return captures;
}

return {
classes: getParams(/\.([\d\w-]*)/g, selector),
element: /(^\w*)/g.exec(selector)[1],
ids: getParams(/#([\d\w-]*)/g, selector)
};
}
const merge = require('deepmerge');

/**
* Given a list of class and ID selectors (prefixed with '.' and '#'),
Expand Down Expand Up @@ -160,13 +138,40 @@ function set (obj, path, value) {
obj[valueKey] = value;
}

/**
* Deduplicate an array by a given key callback.
* Item properties are merged recursively and with the preference for last defined values.
* Of items with the same key, merged item takes the place of the last item,
* others are omitted.
*
* @param { any[] } items An array to deduplicate.
* @param { (x: any) => string } getKey Callback to get a value that distinguishes unique items.
* @returns { any[] }
*/
function mergeDuplicatesPreferLast (items, getKey) {
const map = new Map();
for (let i = items.length; i-- > 0;) {
const item = items[i];
const key = getKey(item);
map.set(
key,
(map.has(key))
? merge(item, map.get(key), { arrayMerge: overwriteMerge })
: item
);
}
return [...map.values()].reverse();
}

const overwriteMerge = (acc, src, options) => [...src];

module.exports = {
get: get,
limitedDepthRecursive: limitedDepthRecursive,
mergeDuplicatesPreferLast: mergeDuplicatesPreferLast,
numberToLetterSequence: numberToLetterSequence,
numberToRoman: numberToRoman,
set: set,
splitClassesAndIds: splitClassesAndIds,
splitSelector: splitSelector,
trimCharacter: trimCharacter
};
Loading

0 comments on commit 827a3f2

Please sign in to comment.