Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace cheerio with native API #3677

Open
curbengh opened this issue Aug 18, 2019 · 16 comments

Comments

@curbengh
Copy link
Contributor

commented Aug 18, 2019

Feature Request

From the benchmark (performed by @SukkaW) we can see that cheerio can cause significant performance hit.

While cheerio is not necessary for the core functions of Hexo, it is currently utilized by several default plugins:

We are hoping to re-implement them without depending on cheerio, if possible.

Related PRs:

Related issue:

@curbengh

This comment has been minimized.

Copy link
Contributor Author

commented Aug 18, 2019

Can this be part of the roadmap?
@NoahDragon

@curbengh curbengh referenced this issue Aug 18, 2019
1 of 1 task complete
@SukkaW

This comment has been minimized.

Copy link
Member

commented Aug 18, 2019

Cheerio use perse5 now and use htmlparser2 before.

There is a benchmark between NodeJS side dom parser: https://travis-ci.org/AndreasMadsen/htmlparser-benchmark

image

@curbengh

This comment has been minimized.

Copy link
Contributor Author

commented Aug 18, 2019

That means cheerio@1.x is slower than cheerio@0.x. (Edit: not necessarily after benchmark)


After a glance through the htmlparser2, I wonder if we can use it directly, instead of via cheerio; not sure if that can bring perf benefit.

@SukkaW

This comment has been minimized.

Copy link
Member

commented Aug 18, 2019

@curbengh Hexo is now using cheerio@0.22
Maybe we should first test the performance of hexo using cheerio@1.x, to see is it actually slower.

Update

Here is the result:

Test A

latest hexo master branch, with meta_generator commented out.

image

Test B

sukkaw/hexo#bump-cheerio-1.x using github:curbengh/cheerio#decode-test, with meta_generator commented out.

image

The cheerio@1.0.0-rc3 is even slightly faster than cheerio@0.22.

@curbengh

This comment has been minimized.

Copy link
Contributor Author

commented Aug 19, 2019

Interesting, perhaps cheerio@1.x includes some optimizations that somehow negate the latency of parse5. Judging from cheeriojs/dom-serializer#85, cheerio@1.x dropped decodeEntities and xmlMode options, maybe that's why it's faster.

This also means parse5 may not be that slow, so we can consider it in addition to htmlparser2.

@segayuu

This comment has been minimized.

Copy link
Contributor

commented Aug 19, 2019

The DOMParser package performs high-speed analysis according to web standards Draft.
However, it may be inconvenient because it is a pure parser.

@curbengh

This comment has been minimized.

Copy link
Contributor Author

commented Aug 19, 2019

DOMParser looks promising, too bad it can't do any manipulation.


Just realized DOMParser can be utilized in open_graph since it just needs to parse.


and in unit tests.

Edit: Unit test is usually tiny (<5 posts), so the improvement might not be noticeable.

@curbengh curbengh changed the title Replace cheerio with native API Replace cheerio with string manipulation Aug 19, 2019
@curbengh curbengh changed the title Replace cheerio with string manipulation Replace cheerio with native API Aug 19, 2019
@curbengh

This comment has been minimized.

Copy link
Contributor Author

commented Aug 19, 2019

I just tried re-implementing toc using dom-parser. Somehow parentNode doesn't work, to replace $.parent().

const DomParser = require('dom-parser')
const source = `<span class="john">hey<h1>aaa</h1><h2>bbb</h2><h3>ccc</h3></span>`
const dom = new DomParser().parseFromString(source)
const heading = dom.getElementsByTagName('h1')
console.log(heading[0].parentNode)
// null

const cheerio = require('cheerio')
const $ = cheerio.load(source)
console.log($('h1').parent().attr('class'))
// john
@SukkaW

This comment has been minimized.

Copy link
Member

commented Aug 19, 2019

cheerio is easy to use, and I think the problem of cheerio is it will become very slow when it need to load huge content (for small content cheerio will consume 11ms, for huge content cheerio might consume 65ms).
We might try to use regex to obtain small fragment, then let cheerio to load and manipulate that fragment.

@SukkaW

This comment has been minimized.

Copy link
Member

commented Aug 22, 2019

For the optimization for toc() helper, I start with conditional require cheerio and return before cheerio.load: sukkaw/hexo:lazy-cheerio-for-toc. But the performance dropped by 5%, so I stopped.

@curbengh do you work out the dom-parser problem?

@curbengh

This comment has been minimized.

Copy link
Contributor Author

commented Aug 23, 2019

Nope, I'm still far off.

@SukkaW

This comment has been minimized.

Copy link
Member

commented Aug 23, 2019

toc() helper needs parentNode, which makes nearly impossible using regex to replace cheerio.

@segayuu

This comment has been minimized.

Copy link
Contributor

commented Aug 23, 2019

Regular expressions are specialized for searching from the left, and it is necessary to search from the right to find parents efficiently.
Is there any way other than using String#lastIndexOf() and RegExp#sticky flag?

@SukkaW

This comment has been minimized.

Copy link
Member

commented Aug 23, 2019

html-dom-parser, which is a wrapper of htmlparser2, can turn a dom tree into json. And parent is supported in html-dom-parser.

@curbengh Have a look?

@SukkaW

This comment has been minimized.

Copy link
Member

commented Aug 27, 2019

Currently I have came up with this, a pieces of codes to flatten the html-dom-parser output.

const parser = require('html-dom-parser');

const output = parser('<p>Hello, <span>world!</span></p><span>123</span>');
console.log(output);

/*
[ { type: 'tag',
    name: 'p',
    attribs: {},
    children: [ [Object], [Object] ],
    next:
     { type: 'tag',
       name: 'span',
       attribs: {},
       children: [Array],
       next: null,
       prev: [Circular],
       parent: null },
    prev: null,
    parent: null },
  { type: 'tag',
    name: 'span',
    attribs: {},
    children: [ [Object] ],
    next: null,
    prev:
     { type: 'tag',
       name: 'p',
       attribs: {},
       children: [Array],
       next: [Circular],
       prev: null,
       parent: null },
    parent: null } ]
 */

let result= {};

Object.assign(result, parseOutput);

const flatten = (tag) => {
  if (!tag.children.length) return tag;
  for (let child of tag.children) {
    if (child.children) {
      flatten(child);
    }
    result[Object.keys(result).length] = child;
  }
};

for (let tag of parseOutput) {
  flatten(tag);
}

console.log(result);

/*
{ '0':
   { type: 'tag',
     name: 'p',
     attribs: {},
     children: [ [Object], [Object] ],
     next:
      { type: 'tag',
        name: 'span',
        attribs: {},
        children: [Array],
        next: null,
        prev: [Circular],
        parent: null },
     prev: null,
     parent: null },
  '1':
   { type: 'tag',
     name: 'span',
     attribs: {},
     children: [ [Object] ],
     next: null,
     prev:
      { type: 'tag',
        name: 'p',
        attribs: {},
        children: [Array],
        next: [Circular],
        prev: null,
        parent: null },
     parent: null },
  '2':
   { data: 'Hello, ',
     type: 'text',
     next:
      { type: 'tag',
        name: 'span',
        attribs: {},
        children: [Array],
        next: null,
        prev: [Circular],
        parent: [Object] },
     prev: null,
     parent:
      { type: 'tag',
        name: 'p',
        attribs: {},
        children: [Array],
        next: [Object],
        prev: null,
        parent: null } },
  '3':
   { data: 'world!',
     type: 'text',
     next: null,
     prev: null,
     parent:
      { type: 'tag',
        name: 'span',
        attribs: {},
        children: [Array],
        next: null,
        prev: [Object],
        parent: [Object] } },
  '4':
   { type: 'tag',
     name: 'span',
     attribs: {},
     children: [ [Object] ],
     next: null,
     prev:
      { data: 'Hello, ',
        type: 'text',
        next: [Circular],
        prev: null,
        parent: [Object] },
     parent:
      { type: 'tag',
        name: 'p',
        attribs: {},
        children: [Array],
        next: [Object],
        prev: null,
        parent: null } },
  '5':
   { data: '123',
     type: 'text',
     next: null,
     prev: null,
     parent:
      { type: 'tag',
        name: 'span',
        attribs: {},
        children: [Array],
        next: null,
        prev: [Object],
        parent: null } } }
 */
@curbengh

This comment has been minimized.

Copy link
Contributor Author

commented Aug 28, 2019

I am indeed struggling with traversing levels of html-dom-parser output. I'll see how I can proceed. Feel free to work on toc.js, I may only do it this weekend. (edit: or next)

@NoahDragon NoahDragon referenced this issue Sep 28, 2019
16 of 50 tasks complete
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.