Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use meeseeks_html5ever instead of html5ever_elixir #2

Closed
mischov opened this issue Apr 4, 2017 · 0 comments

Comments

@mischov
Copy link
Owner

commented Apr 4, 2017

Because html5ever_elixir was designed with Floki's use in mind it parses strings of HTML into the tuple-tree format that Floki.parse outputs.

meeseeks_html5ever is designed for Meeseeks, and it parses strings of HTML directly into the Meeseeks.Document that Meeseeks.parse outputs, thereby saving the need to build a document from a tuple-tree that html5ever_elixir returns.

History

Floki parses html into tuple-trees, uses those tuple-trees to create a flat-map structure like Meeseeks.Document as part of its selection process, and then converts any matching nodes back into tuple-trees.

Meeseeks takes a different approach. Meeseeks.parse returns a Meeseeks.Document which can be used in the selection process without any additional work. Selection itself returns Meeseeks.Results, which are pointers into the Meeseeks.Document that was queried.

I chose this approach because it doesn't delay the building of the Meeseeks.Document, which has to happen anyway, and because it doesn't jump the gun in creating a tuple-tree for all results when the user might not need the whole tuple-tree.

Problem

There is a significant performance cost in converting a tuple-tree into a Meeseeks.Document (especially for large documents). The cost can actually be larger than the cost of building the Meeseeks.Document alone would account for (GC cost, maybe?).

Solution

I created a custom html5ever NIF, meeseeks_html5ever, that parses HTML directly into a Meeseeks.Document.

Performance is always a bit tricky to judge, but I've seen around a 30% increase in parse speed for large documents when using meeseeks_html5ever instead of html5ever_elixir.

Implementation

  • The Meeseeks API permits a tuple-tree be provided to Meeseeks.parse, so that still has to be accounted for by moving the Meeseeks.Document.new functionality into a tuple-tree specific parser
  • meeseeks_html5ever builds a slightly different Meeseeks.Document than Meeseeks.Document.new does, so the differences need to be accounted for
  • The Document.Doctype field type has been renamed name
mischov added a commit that referenced this issue Apr 8, 2017
Merge pull request #3 from mischov/meeseeks-html5ever
Replace html5ever_elixir with meeseeks_html5ever (#2)

@mischov mischov closed this Apr 8, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
1 participant
You can’t perform that action at this time.